The FreeBSD Diary |
![]() |
(TM) | Providing practical examples since 1998If you buy from Amazon USA, please support us by using this link. |
|
RAID-5 drive failure
25 July 2006
|
|
Back in December 2004, I wrote about implementing hardware RAID-5. Yesterday, one of the drives in that cluster failed. Here is the output which shows a failed drive:
The key point is the 'Failed drive'. I happen to have an identical drive here, just sitting around, for just this event. |
|
But before I do that, let us try a rebuild
|
|
Let's try a rebuild first: raidutil -a rebuild d0 d0b0t0d0 As I type this, the system is now in this state (I will show only a small extract from the full output): $ sudo raidutil -L all Password: RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FW NVRAM Serial Status --------------------------------------------------------------------------- d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal Physical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive Logical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Reconstruct 3% d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive ... As you can see, we are at about 3% on the rebuild. Time to wait and see. It is now 15:03 EST. |
|
12 hours later...
|
|
Twelve hours later, NetSaint sent me this email: ***** NetSaint 0.0.7 ***** Notification Type: RECOVERY Service: RAID Host: polo Address: polo.unixathome.org State: OK Date/Time: Tue Jul 25 03:49:49 EDT 2006 Additional Info: Optimal The rebuild took about 13 hours all up... I'm glad the machine was online during that time. The machine in question runs the FreshPorts BETA site and is my main development server. |
|
What caused the problem?
|
|
I don't know what caused the problem. I know some of the symptoms.
After about 20 or 30 minutes trying to get the system going, I rebooted it. Of course, this would degrade the RAID array, and I wanted to avoid that. I saw no other options. I rebooted the box. It was suggested that one drive may have been experiencing an error. HDD try to solve errors and can take a long time attempting to recover. The RAID card can see this and just waits. No I/O occurs during this time. Western Digital has drives which are designed for RAID and feature TLER (Time Limited Error Recovery). Such features have been available on SCSI drives for quite some time. For what it's worth, the drives I'm planning to buy for the Dual Opteron server will have TLER. Ideas? Suggestions? Comments? Please use the comments link to the right. |