What is SMART?
Self-Monitoring, Analysis, and Reporting Technology) is a built-in monitoring and reporting technology included in computer hard disk drives (HDDs) and solid-state drives (SSDs).
SMART was introduced in the 1990’s and quickly adopted by all disk manufacturers to detect and report indicators of drive reliability. The intent was to diagnose failure modes in returned disk drives. Later, it was extended to help predict imminent disk failures.
While a whole range of SMART values are stored and reported by disk drives, only a few are associated with actual imminent failure of disks. These parameters as reported in studies by Google and BackBlaze are:
For Flash media: (only)
There are other SMART parameters that some feel may be correlated with imminent failure, such as power on events, and shock events, but generally, it is accepted the above SMART parameters are the most critical for predicting disk failures.
SoftRAID uses SMART intelligently, to help users understand when their drives show signs of failure and can be replaced before they complete failure.
SoftRAID adds “Disk I/O errors” to our predicted failure alerts. While not a “SMART” failure, I/O errors indicate either the disk cannot read or write, or there is a communication error, which could indicate a hardware issue with the drive, its enclosure, or perhaps the volume directory being damaged.
SoftRAID runs SMART checks once a day and when you launch the SoftRAID application.
Failure mode: SMART – Fail
If you get an alert that a disk has failed the SMART test, it has already failed. A disk that fails SMART cannot come back to life and can cause data corruption on other drives in your enclosure. Immediately stop using any disk that shows a SMART failure. This applies to HDD’s, SSD’s and NVMe drives.
There is a chance you may have time to copy data from the disk before it stops functioning, but it’s wise to immediately remove this disk from your enclosure and replace it. If your volume is RAID 4, RAID 5, or RAID 10, your volume should still function with a disk missing.*
*(See volume does not mount)
Predicted Failure Modes
Predicted failures include reallocated, uncorrectable sectors, unreliable sectors, pending reallocations, and failed reallocations.
What is a sector and why are these parameters important?
A drive is comprised of “chunks” of data, called sectors. They are generally 512 bytes in size, or 4k (4096 bytes). A “checksum” is added to each sector to help reverse engineer the data values stored in the sector; in the event it cannot be read. When a drive can no longer reliably read a sector of data, this data is automatically moved to another location on the drive. This is called a “reallocated sector”.
In theory, if a reallocated sector is OK, the drive should be able to continue working as normal. Back in the early days of disk drives, reallocated sectors were very common, and drives could have hundreds of them. However, as disks became more reliable over time, a reallocated sector on a drive became rare.
IT managers thought there may be a correlation between a drive getting a reallocated sector and the drive failing, but a Google study of 100,000 disks confirmed this was the case. Google found that a disk with even a single reallocated sector was 20-60 times more likely to fail in the next 60 days than a normal drive. Others have replicated this result since.
Thus, do not ignore this SMART parameter, it is a highly correlated indicator that your disk is likely to fail soon.
Here is what each of these Predicted failure modes indicate and what you should do:
Reallocated sectors (HDDs only)
Once a disk has reallocated sectors, this is an indicator, according to multiple studies of thousands of disk drives, that the drive has started to fail. It may last another year, even more, or it may fail in the next 24 hours. It is like a diagnosis of congestive heart disease. You can live another decade or have a fatal stroke the next day.
If your disk has reallocated sectors, plan on replacing it immediately. If this “just” happened, and the reallocated sector count is low, you probably have time to order a new drive and replace it. However, if the reallocated sector count starts growing, or passes the mid-teens, you should remove the drive to avoid more serious data issues.
Sometimes a power outage or powering up a drive that has not powered up for a long time may cause the drive to reallocate sector or two. While that drive may last many months, it’s best to immediately replace it, and perhaps use it as a second or third backup drive.
Failed reallocations
Drives with failed reallocations indicate the drive attempted to reallocate sectors but failed. This state generally means the drive has mere days or weeks before failure. Replace this drive immediately.
Pending reallocation
Pending reallocations emerge when a drive retried to read a sector multiple, use the checksum to retrieve the data from the sector, and is now marked for reallocation. It’s unclear whether this is a permanent failure mode or not, but it likely is.
What we recommend is you certify the disk.
If the disk passes a certify without reallocating these sectors, you can put it back into service.
Unreliable sectors
If a cable gets pulled unexpectedly, the Thunderbolt bus ejects/resets, a power surge/brownout, etc, the drive may have to retry to read the data. The disk marks this sector as “unreliable”.
What we recommend is you certify the disk.
If the disk passes a certify without reallocating these sectors, you can put it back into service.
A disk with “unreliable” sectors often passes the certify and clears this error condition.
Excess Power on Hours on Disks
Hard drives have a limited time of usage. Most industry experts suggest replacing mission critical drives after 20,000 to 25,000 hours. Once a drive has this many hours of use, they begin to fail at a statistically faster rate. For example, a set of drives may have a 1.5% likelihood of each drive having a failure each year. As the drives age, this number increases slowly, but by 3 years/20-25,000 hours, this chance of failure is at least 5-10% higher per year of use. Very few drives survive past 40,000 hours, and by 50,000 hours, 90% of drives have failed.
SoftRAID alerts when drives start to reach the point where the failure rate increases, so you can replace your drives before they fail.
Media Wear SSD + Flash Media worn out
Flash media (SSD/NVMe drives) fail differently than mechanical hard drives.
Flash media is designed to constantly reallocate bad sectors. Thus, reallocation counts do not make any sense as a predictive mode of failure. However, flash media drives have a “limited” number of spare sectors. When spare sectors run low, the drive is likely to fail imminently.
SoftRAID alerts you when the drive is running out of spare sectors. When SoftRAID is able obtain the “media wear” indicator, it will show you this number, starting at 100%, then gradually decrease to 90%, 80%, etc.
Flash media drives should perform well down to 10%. At 10% wear (spare sectors) remaining you need to replace the drive ASAP.
Media work out:
When a flash drive runs out of sectors, then the drive has failed. Flash drives are designed to keep working when they run out of spare sectors, but they are converted to “read only” and can no longer be written to. We have also seen drives simply stop working completely without notice. Never let a flash media drive get down to 0% media wear, as unpredictable behavior may occur.
I/O Errors
SoftRAID reports any time a disk has an I/O error. If possible, SoftRAID also saves the I/O error counters to the disks and volumes.
An I/O error means a read or a write “failed” to be completed. There are several reasons for this:
I/O errors should be treated seriously, but they need to be investigated. An I/O error does not mean your disks failed. It means communication failed, for one of the reasons above.
To clear the I/O error count, on both disks and volumes, select the disk(s) or volume, and in the SoftRAID utilities menu, “Clear I/O counters”, and select errors only.
Look for a pattern if the error happens again. You can open a support case with SoftRAID, and save a SoftRAID tech support file, if you need help diagnosing this issue.
Note: if you get I/O errors on disks that have reallocated sectors, or are predicted to fail, they are likely the result of disk failure and need to be replaced immediately.