Ever since I got my server started, it’s gone through various changes. It started on OpenSolaris, eventually got stable on FreeNAS, and finally matured into a more permanent FreeBSD.

With that, there has been some hardware changes, but moreso in the core guts of the machine. What hasn’t changed is the hard drives it has been running on. Unfortunately, they’re also running a tad long in the tooth. Dumping out the Power_On_Hours line from smartctl gives me a range of 30998–43674 hours (3.54–4.99 years). Yup. They’ve been powered on upwards of 5 years now.

Most of the drives are doing fine (Reallocated_Sector_Ct line is giving 0 for half the drives), but some of them are slowly accruing bad sectors (most are in the single digits, but I have two at 15 & 37, respectively). Unfortunately, these usually pop up overnight during the daily script FreeBSD runs from the smartmontools port, so by the morning, I’ve already gotten the email (example shown when you click through to the rest of the post) that the drive has been taken offline. Since these are all in a RAIDZ setup, a single drive loss is no big deal, but I do have to resolve the issue so the array does not remain degraded. After doing this over numerous incremental errors (out of a dozen read errors, I get maybe 1 or 2 reallocated sectors), I’ve semi-automated the process (although I need to write a better bash script to do this without intervention).

So this is typically what I get emailed by the maintenance scripts:

This message was generated by the smartd daemon running on:

host name: ********
DNS domain: ********

The following warning/error was logged by the smartd daemon:

Device: /dev/ada3, 2 Currently unreadable (pending) sectors

Device info:
ST31500341AS, S/N:********, WWN:*-******-*********, FW:****, 1.50 TB

For details see host’s SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

So, for my 1.5 TB Seagate Barracuda (I’ve got twelve of them in this server), I’m getting some unresponsive sectors, that eventually follow up with

This message was generated by the smartd daemon running on:

host name: ********
DNS domain: ********

The following warning/error was logged by the smartd daemon:

Device: /dev/ada3, 2 Offline uncorrectable sectors

Device info:
ST31500341AS, S/N:********, WWN:*-******-*********, FW:CC1H, 1.50 TB

For details see host’s SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

At this point, ZFS wants to dump the drive from the zpool (if it hasn’t been already), and I need to resolve those sectors. In order to do this, I have to find the sectors that are throwing the errors, manually attempt to write to those sectors to force the drive to either (a) make it work, or (b) find out the sector is bad, and subsequently do this for every problematic sector at the time.

So. First thing’s first: safely (if it hasn’t been dropped already) remove the drive from the zpool:
sudo zpool offline -t buffalo ada3
Next, we need to find the problematic sector, so we need to (typically) run a long selftest from smartctl to find the first sector:
sudo smartctl -t long /dev/ada3
Time estimate for this on a 1.5 TB drive is four hours, if I remember correctly. However, the test will stop whenever it finds the first bad sector. So this may only take 5 minutes (probably bad news if the sectors are that early on the drive), or may take nearly all 4 hours (like most of mine, they’re in the last 10% of the drive). So you’ll need to keep checking back every now & then for the test results to get that first problematic sector:
sudo smartctl -l selftest /dev/ada3
Once you’ve gotten it, now comes the tediously repetitious part: writing to the sector, reinitiating the long test, and repeating until the drive comes back clean. Waiting ~4 hours for each run is unfeasible, so thankfully smartctl can be told what sectors to perform a long test on with the selective test flag. With that in hand, I’ve got a handy single-line command to extract the sector from the selftest output, write to it, and restart the scanning with smartctl -t selective:
SECTOR=`sudo smartctl -l selftest /dev/ada3 | grep '# 1' | awk '{print $10}'` && \
LAST=`sudo diskinfo /dev/ada3 | awk '{print $4}'` && \
sudo dd if=/dev/zero of=/dev/ada3 bs=512 seek=${SECTOR} count=1 && \
sudo smartctl -t selective,${SECTOR}-2930277168 /dev/ada3

Various options with dd & smartctl will need to be modified for any drive(s) you may also use this on. My drives use 512-byte sectors, but if you have a more up to date 4k (“advanced format”) drive, you’ll be instead throwing bs=4096 at it. I’m only writing to the single sector so that way I can (a) track each & where the sectors are, and (b) minimize the amount of destroyed (yes, destroyed…this is destructive no matter how you do it). However, if these sectors cluster around each other (they often have for me), you could increase the count option to just barrel through larger swaths of sectors if you’re impatient. I’ll elaborate below on why that is probably not the smartest option. But carrying on…
Lastly, you may also need to tweak the latter number in the selective,<start_sector>-<last_sector> option: I use the last (or total number of) sector of the disk (added a line above to extract sector numbers from the drive’s diskinfo) it will enter in the last & final sectors, so it only scan onwards from the bad sector until the end. This saves going over all the good ones prior to, and pretty often results in the next bad sector within a minute or so (at least in my experience). Just keep repeating that all-in-one command line (in bash or sh, you can conveniently achieve this through just pressing the Up-arrow key followed by Enter) until you get through all the sectors.

Sometimes the gaps are big, so it’s probably a better idea to check the selftest results before running the all-in-one line (to ensure it’s reached the next bad sector), but if you’re doing this while multitasking something else (e.g. laundry, dishes, chores, other server maintenance, watching television, etc.), you could probably skip the selftest check.

Once the selective selftest finally completes cleanly, I like to do one more comprehensive selftest to ensure the drive is currently clear of bad sectors:
sudo smartctl -t long /dev/ada3
…upon which, if this also gives me a nice clean scan, then I bring the drive back online to the zpool:
sudo zpool online buffalo /dev/ada3
…at which point, ZFS will see that there has been some data loss, so it will re-silver the array, rebuilding the lost sectors’ data.

To which this last point, is why keep the count option to 1 is probably safer than something like 1024 or whatever conveniently large number you feel like. The first time or two I got these errors, I just did the equivalent of a whole-disk format to force remapping of any/all bad sectors, as I used to do this back on Windows machines (aka a regular format, not a “quick format”) to remap any bad sectors on a drive (I forget the reason why I’d have to do this for other people, but it did the trick). However, this can be problematic when re-silvering the array now.

Re-silvering will just repair/rewrite the missing data. If you wipe the entire drive, it has to rebuild the entire drive from the others (3 in my case for my arrays). This can take hours (maybe even days), depending on how much data is on them, and how large they are. But what also becomes crucial is during this time, if even a single other drive in the array runs into a similar read error/bad sector, that entire block/stripe of data is now gone. Kaput. Obliviano. Whatever file it was part of is now toast, and you’ve lost data. In the example I layed out above, I wiped 7 individual sectors. That’s (up to) 7 array stripes to rebuild…they default at 128k & can go larger, but you only rebuild those single stripes during the re-silver…not the entire drive. So now the likelihood of running into another bad sector is reduced by several orders of magnitude, not to mention the re-silver time can be on the order of seconds to minutes before the array is safely operational (aka not in a degraded state) again.

So while this “fix” doesn’t guarantee you’ve fixed your array, you’ve just resolved the underlying bad sectors, and moved the data elsewhere to good sectors. So you can significantly prolong the life of the array by moving through these bad sectors (as the drives are over-provisioned to deal with a certain number of these), but this is only a temporary fix. Ultimately I’m going to need to start replacing these drives (not only due to age, but also because I’m nearing the limits of storage on them), but I have some time.

Feedback from other fora regarding similar issues usually results in frequenting users to very quickly go to “your drive is dying, replace it ASARP” (as soon as reasonably possible, but I’m paraphrasing here), but I feel that is a little overreacting. Maybe I just need to suffer a massive disk failure to see things differently? But these drivers are blowing out these bad sectors in huge bouts; they’re occurring a few at a time, so it seems to just be the age of the drive having an effect.

Worst case scenario…I’m remapping bad sectors maybe once a month, but once these drives pick up more & more bad sectors (#37-and-counting is first on my list), it’ll be time to finally drop a new one in in its place, and cross my fingers that the re-silver doesn’t (completely) tank another drive (I can manage sector by sector losses…but an entire drive failing on re-silvering would kind of suck).

Leave a Reply

Your email address will not be published. Required fields are marked *