Using redundant drives, keeping backups etc. might help to keep your data save. Anyway, being a bit more ahead of the curve is better than just waiting for harddrive failure. Now, how can we improve the overall management of our drives and how can we estimate when it's getting likely that a drive fails. What few people know is that the disk itself provides means (SMART) to receive advance warning that something is wrong. But let's find out about that step by step:
How do I find out my harddrive specs, the drive's manufacturer and capabilities?
On Linux there are several tools that tell you more or less verbosely about the status, name, specs about your harddrives.
will give you a nice overview of installed drives, their name (which we will need later on) and how much space is left on a particular drive.
will give more details about partitions visible to the system.
hdparm -I /dev/sda (or your device's name, find out about it with df -h)
will give detailed data about your drive, the manufacturer, serial code, exact model name, S.M.A.R.T. capabilities and more. Quite handy on a remotely housed server you've never seen. hdparm is not a standard tool on most systems, i.e. Debian. Anyway, installation is simple with
apt-get install hdparm.
How do I predict drive failures, test if my drive is producing errors or doing ok?
On a desktop system, which is bootet quite often, a short diagnostic message on startup might be enough to keep track of your harddisk's general health. Further the generated noise gives you quite a good idea if your data is safe or not - those clicking or strange noises might alert you just in time. A totally different story on a remote server, where you can't inspect the system manually, can't hear it or do regular bootup checks.
That's where SMART comes in. Short for "Self-Monitoring, Analysis, and Reporting Technology" SMART comes by standard with most ATA, SATA, SATA-2 and even SCSI drives, it enables you to log errors, keep track of write/read errors, carry out different types of drive self-tests and more.
First, as the package is non-standard on Debian as well, install the smartmontools with
apt-get install smartmontools
the smartmontools package consists of a command line tool (
cmartctl) and an optional daemon (
smartd) component. Show your drive's SMART health with
smartctl -H /dev/sda
another simple health test:
smartctl -s on -d ata /dev/sda
and read out even more data with
smartctl -d ata -a /dev/sda
Smartctl can perform a number of tests, following the man page "SMART provides three basic categories of testing. The first category, called "online" testing, has no effect on the performance of the device. It is turned on by the '-s on' option." Which is what we did above. "The second category of testing is called "offline" testing. This type of test can, in principle, degrade the device performance. The '-o on' option causes this offline testing to be carried out, automatically, on a regular scheduled basis." Now, "the third category of testing (and the only category for which the word 'testing' is really an appropriate choice) is "self" testing."
Start a quick offline test with
smartctl -t offline /dev/sda
and read the output with
smartctl -d ata -l error /dev/sda2
Do a more intense short test with
smartctl -t short /dev/sda
and read out the errors with
smartctl -d ata -l selftest /dev/sda2
and so on. Read the man page for more testing options using the -t switch.
Don't get too confused about messages telling you "Device does not support Self Test logging" and similar. Most modern drives support SMART extensively, in most cases these messages just tell you that you haven't unleashes all features of SMART in your configuration or are using the wrong driver to reach your SMART capable drive.
If you get "Error Counter logging not supported" on the
smartctl -l error /dev/sda comand, try this:
smartctl -d ata -l error /dev/sda2, which explicitely tell smartctl what driver to use (ata here) to read out data.
The difference between using smartctl manually and/or running the daemon process smartd is that smartctl only checks if you start it by hand, the daemon does it automatically in predefined intervalls. From the man: "smartd is a daemon that monitors SMART. smartd will attempt to enable SMART monitoring on ATA devices (equivalent to
smartctl -s on) and polls these and SCSI devices every 30 minutes (configurable), logging SMART errors and changes of SMART Attributes via the SYSLOG interface."
Read the notes in the smartd config file at /etc/smartd.conf and the man page if you opt for the automatic/scheduled version of SMART monitoring.
According to this research paper from Google employees,the most important stats in your SMART log are scan errors, reallocation counts and probational count. After the first of one of these errors, probability of drive survival began to decline significantly over the next 8 months. Other factors are not so helpful, for example, in the research, not one drive among Google's thousands of drives had "spin up retries".
Conclusion: "We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components."
So following that as a note: Don't expect too much from SMART, it can help you diagnose a running drive but data aquired from that will not 100% tell you about a drive's fate. So use it, but do not rely solely on SMART!More reading:
Starting the SMART daemon and running it (German).