What is best way to monitor for bad sectors?

PeterSteele · #1 **(View Single Post)** 9th August 2008

We are developing an appliance based on FreeBSD 7.0. We will be running a custom FS on top of both UFS and raw partitions and will need to have a way to deal with bad sectors. Ideally we'd like to be able to subscribe to a system event and have FreeBSD alert us when a bad sector is detected. I don't think there is a mechanism to do this though. We've noted that the system does seem to log a message in /var/log/dmesg when a bad sector is encountered, for example, these messages are occurring on one of ours systems right now:

ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=343237347
g_vfs_done():ad8s1d[READ(offset=143525234688, length=2048)]error = 5
ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=343237347
g_vfs_done():ad8s1d[READ(offset=143525234688, length=2048)]error = 5
ad8: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=343237347
g_vfs_done():ad8s1d[READ(offset=143525234688, length=2048)]error = 5

I assume these are due to bad sectors. So we thought one alternative could simply be to have a process that polls and scans /var/log/dmesg periodically and reports back to our management system when one of these errors are detected.

Is this a reliable/reasonable approach? Is there a better way to monitor a system for bad sectors?

I should add that we do have SMART installed and operational, but our experience so far with SMART is that it will not alert us to a sector that goes bad without warning. If we're mistaken about this, we'd appreciate some feedback. Thanks.

PeterSteele · #2 **(View Single Post)** 10th August 2008

I should also point out that these are SATA drives. A preferred option of course would be to query the SATA driver directly to pull this kind of information. Is there an ioctl for this kind of query. I don't know for sure but I assume we are using the generic FreeBSD SATA driver.

J65nko · #3 **(View Single Post)** 10th August 2008

You could have a look at the source code of http://www.freebsd.org/cgi/url.cgi?p...ools/pkg-descr . And/or the source of the OpenBSD atactl(8) utility.

If this doesn't bring you any further you could consider to post on one of the FreeBSD mailing lists, where the developers hang out.

For the reader who is interested in SMART see http://en.wikipedia.org/wiki/Self-Mo...ing_Technology

lvlamb · #4 **(View Single Post)** 10th August 2008

Yeah! In theory bad sectors are managed through S.M.A.R.T., but frankly, if bad sectors are a concern and have to be monitored, junk that bloody hard drive.
Once S.M.A.R.T. signals unrecoverrable errors the drive already is dead and begs for a sysutils/testdisk of last resort..

PeterSteele · #5 **(View Single Post)** 11th August 2008

I will have to investigate S.M.A.R.T further. We do have some monitoring currently in place but a recent case of bad sectors did not get detected by our SMART subsystem. There may be more that we can do here; I've just taken over this area of our code and I have some learning to do...

lvlamb · #6 **(View Single Post)** 11th August 2008

IDE drives have dedicated cylinder(s) on which sectors are mapped to the bad sectors. When these cylinders are full, bad sectors can't be mapped anyore and just get flagged.
In theory, S.M.A.R.T. should issue warnings, if well implemented by both you and the manufacturer.

What I would do is to run the manufacturer's utility disk. A warranty test or RMA test would compare manufacturr's specs to the real status of the drive, eventually generate a report (or just a code number) of found defects.
Sometimes, "zeroing" the drive (long process, writes patterns on the whole surface) can help revive the drive.
For a while. As long as new bad sectors aren't found.
You could still use the drive for testing purposes or non-critical backups, but not anymore for intensive read-writes.

In any case, bad sectors means the drive isn't fit for production anylonger.

PeterSteele · #7 **(View Single Post)** 12th August 2008

The issue is that these systems will be running at customer sites and we need our software to be able to detect failing drives and take corrective action. It's an HA system so when a drive is deemed usable another drive will have to take over. We can detect a completely failed drive, it's just these bad sectors we need to find a way to deal with. We ultimately want to treat a drive that we detect bad sectors on the same way as a drive that fails completely. Essentially, we want to be alerted as soon as a sector goes bad, or at least within a very short period of time...

phoenix · #8 **(View Single Post)** 13th August 2008

Sounds like you are trying to recreate RAID in software. Wouldn't it make more sense to invest in good quality hardware RAID controllers, that already have this kind of monitoring built-in, along with e-mail and/or SMS alerts?

We've been using AMCC/3Ware 9000-, 9500-, and 9600-series RAID controllers for just this reason, and the onboard e-mail alerts have allowed us to replace 4 drives so far, before they died completely (and still within the manufacturer's warranty period).

These do SMART monitoring along with a bunch of other stuff. If you don't need the RAID features, you can create "Single Disk" arrays that allow you to see/use each drive individually, but with all the onboard cache and monitoring features of the controller (we do this in our ZFS storage servers).

PeterSteele · #9 **(View Single Post)** 14th August 2008

Yes, it does seem like RAID would be the obvious solution but the nature of our product precludes this option...

phoenix · #10 **(View Single Post)** 16th August 2008

Even a low-profile, 4-port version?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
gdm/ new monitor issue	jimbus	FreeBSD General	3	4th August 2009 07:39 PM
Wireless networking xor battery monitor	Nobber	OpenBSD General	5	27th February 2009 12:25 PM
Network analyzer/monitor suggestion?	Bruco	FreeBSD Ports and Packages	2	29th January 2009 06:42 PM
wlan -> monitor mode	ccc	FreeBSD Security	2	4th November 2008 09:19 PM