1. Home
  2. Knowledge Base
  3. SoftRAID
  4. Predicted, SMART Failure

Predicted, SMART Failure

This support section details information and troubleshooting steps on predicted failures, SMART failures, Flash Media worn out, and Failed SMART/Predicted failure on exempted drives.

What IS SMART?
Self-Monitoring, Analysis, and Reporting Technology) is a built-in monitoring and reporting technology included in computer hard disk drives (HDDs) and solid-state drives (SSDs).

SMART was introduced in the 1990s and quickly adopted by all disk manufacturers to detect and report indicators of drive reliability. The intent was to diagnose failure modes in returned disk drives and ultimately extended to help predict imminent disk failures.

While a whole range of SMART values are stored and reported by disk drives, only a few are associated with actual imminent failure of disks. These parameters, as reported in studies by google and BackBlaze, are:

  • SMART check (pass / fail)
  • Reallocated sectors
  • Failed reallocations
  • Pending reallocations
  • Unreliable sectors
  • For Flash media: (only)

  • media wear indicator (10% or less)
  • Media worn out (drive has been marked read only)
  • There are other SMART parameters that some feel may be correlated with imminent failure, such as power on events, shock events, but it is generally accepted the above 5 SMART parameters are the most critical for predicting disk failures.

    SoftRAID uses SMART intelligently, to help users understand when their drives may be starting to fail, so they can replace them before they actually fail.

    We have added “Disk IO errors” to our predicted failure alerts. While not a “SMART” failure, IO errors indicate either the disk cannot read or write, or there is a communication error, which could indicate a hardware issue with the drive, its enclosure, or the volume directory being damaged.

    SoftRAID runs SMART checks once a day and when you launch the SoftRAID application.

    Here are what the various SMART alerts indicate:

    Failure mode: SMART – fail

    If you get an alert that a disk has failed the SMART test, it has already FAILED.

    There is a chance you may have time to copy data off it before it stops functioning totally, but it’s wise to immediately remove this disk from your enclosure and replace it. If your volume is RAID 4, RAID 5, or RAID 10, your volume should still function with a disk missing. * (See volume does not mount)

    A disk that fails SMART cannot come back to life, but it can cause data corruption on other drives in your enclosure, so immediately stop using any disk that shows a SMART failure.

    This includes HDD’s, SSD’s and NVMe drives.

    Predicted Failure modes

    Predicted failures include reallocated, uncorrectable sectors, unreliable sectors, pending reallocations, and failed reallocations.

    What is a sector and why are these parameters important?
    A drive is comprised of “chunks” of data, called sectors. They are generally 512 bytes in size, or 4k (4096 bytes). A “checksum” is added to each sector to help reverse engineer the data values stored in the sector; in the event it cannot be read. When a drive can no longer reliably read a sector of data, this data is automatically moved to another location on the drive. This is called a “reallocated sector.”

    In theory, a reallocated sector is OK, and the drive should be able to continue working as normal. Back in the early days of disk drives, reallocated sectors were very common, and drives could have hundreds of them. However, as disks became more reliable over time, a reallocated sector on a drive became rare.

    IT managers thought there may be a correlation between a drive getting a reallocated sector and the drive failing, but it was the Google study of 100,000 disks that proved this was the case. They found that a disk with even a single reallocated sector was 20-60 times more likely to fail in the next 60 days (about 2 months) than a normal drive. Others have replicated this result.

    So do not ignore this SMART parameter, it is a highly correlated indicator that your disk is likely to fail, very soon.

    Here is what each of these Predicted failure modes indicate and what you should do:

    Reallocated sectors (HDDs only)
    Once a disk has reallocated sectors, this is an indicator, according to multiple studies of thousands of disk drives, that the drive has started to fail. It may last another year, even more, but it may fail in the next 24 hours. It is like a diagnosis of congestive heart disease. You can live another decade or have a fatal stroke the next day.

    If your disk has reallocated sectors, plan to replace it immediately. If this “just” happened, and the reallocated sector count is low, you probably have time to order a new drive and replace it. However, if the reallocated sector count starts growing, or passes the mid-teens, you should just pull out the drive to avoid more serious data issues.

    Sometimes a power outage or powering up a drive that has not been powered up for a long time may cause the drive to reallocate a sector or two. While that drive may last many months, it’s best to immediately replace it, and use it as a second or third backup drive.

    Failed reallocations
    Drives with failed reallocations are a more serious issue. This indicates the drive attempted to reallocate sectors but failed. This state means the drive has mere days or weeks before failure. Replace this drive immediately.

    Pending reallocation
    This state emerges when a drive has to retry a sector multiple times to read it and use the checksum to retrieve the data from the sector, and it is marked for reallocation. It is unclear whether this is a permanent failure mode or not, but it likely is.

    What we recommend is you certify the disk.

    If the disk passes a certify without reallocating these sectors, you can put it back into service.

    Unreliable sectors
    This is the state most likely to be correctible. If a cable gets pulled unexpectedly, the Thunderbolt bus ejects/resets, if there is a power surge/brownout, etc., the drive may have to retry to read the data. The disk marks this sector as “unreliable.”

    What we recommend is you certify the disk.

    If the disk passes a certify without reallocating these sectors, you can put it back into service.

    A disk with “unreliable” sectors often passes the certify and clears this error condition.

    Excess Power on Hours on Disks
    Hard drives have a limited time of usage. Most industry experts suggest replacing mission-critical drives after 20,000 to 25,000 hours (about 3 years). Once a drive has this many hours of use, they begin to fail at a statistically faster rate. For example, a set of drives may have a 1.5% likelihood of each drive having a failure each year. As the drives age, this number increases slowly, but by 3 years/20-25,000 hours (about 3 years), this number is much higher by at least 5-10% per year of use.

    Very few drives survive past 40,000 hours (about 4 and a half years), and by 50,000 hours (about 5 and a half years), 90% of drives have failed.

    SoftRAID alerts you when drives start to reach the point where the failure rate increases, so you can replace your drives before they fail.

    Media Wear SSD + Flash Media worn out
    Flash media (SSD/NVMe drives) do not fail in the same way as mechanical hard drives. It is still exceedingly difficult to predict in advance when a flash media drive will fail.

    Flash media is designed to constantly reallocate bad sectors, so reallocation counts do not make any sense as a predictive mode of failure. However, flash media drives have a “limited” amount of spare sectors, so when these run low, the drive is likely to fail imminently.

    What SoftRAID does is alert you when the drive is running out of spare sectors. Where SoftRAID can obtain the “media wear” indicator, it will show you this number, starting at 100%, then it gradually goes down to 90%, 80%, etc.
    Flash media drives should perform well down to 10% remaining. At 10% wear (spare sectors) remaining you need to replace the drive ASAP.

    Media work out:
    When a flash drive runs out of sectors, then the drive has failed. Flash drives are designed to keep working when they run out of spare sectors, but they are converted to “read only” and can no longer be written to. We have also seen drives simply stop working completely without notice. Never let a flash media drive go all the way down to 0% media wear, as unpredictable behavior may occur.

    Failed SMART/Predicted failure on Exempted Drives

    [INSERT DETAILS ON WHY/WHAT THIS MEANS]

    Have no idea what this is? Is this a Windows thing?

    IO Errors
    SoftRAID reports any time a disk has an IO error. If possible, SoftRAID also saves the io error counters to the disks and volumes.

    An IO error means a read or a write “failed” to be completed. There are several reasons for this:

  • The disk failed, or is failing
  • The disk temporarily “hung” (or its enclosure)
  • The disk(s) ejected or were unplugged during an IO event (read or write)
  • A kernel panic, where the mac was unable to continue and crashed.
  • The disk was asked to read or write to an impossible location (i.e, from a damaged volume directory)
  • IO errors should be treated seriously, but they need to be investigated. An IO Error does not mean your disks failed. It means communication failed, for one of the reasons above.

    To clear the IO error count, on both disks and volumes, select the disk(s) or volume, and in the SoftRAID utilities menu, “Clear IO counters”, and select errors only.

    Look for a pattern to this if it happens again. You can open a support case with SoftRAID, and save a SoftRAID tech support file, if you need help diagnosing this issue.

    Note: if you get IO errors on disks that have reallocated sectors, or are predicted to fail, they are likely the result of disk failure and need to be replaced immediately.

    Was this article helpful?

    Related Articles

    Need Support?

    Can't find the answer you're looking for?
    Contact Support
    Do Not Share My Personal Information