The FreeBSD Diary |
(TM) | Providing practical examples since 1998If you buy from Amazon USA, please support us by using this link. |
RAID-5 drive failure
25 July 2006
|
Back in December 2004, I wrote about implementing hardware RAID-5. Yesterday, one of the drives in that cluster failed. Here is the output which shows a failed drive: $ raidutil -L all RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FW NVRAM Serial Status --------------------------------------------------------------------------- d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal Physical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Failed drive Logical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Degraded d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Failed drive Address Max Speed Actual Rate / Width --------------------------------------------------------------------------- d0b0t0d0 50 MHz 100 MB/sec wide d0b1t0d0 50 MHz 100 MB/sec wide d0b2t0d0 50 MHz 100 MB/sec wide d0b3t0d0 10 MHz 100 MB/sec wide Address Manufacturer/Model Write Cache Mode (HBA/Device) --------------------------------------------------------------------------- d0b0t0d0 ADAPTEC RAID-5 Write Back / -- d0b1t0d0 ST380011 A -- / Write Back d0b0t0d0 ST380011 A -- / Write Back d0b2t0d0 ST380011 A -- / Write Back d0b3t0d0 ST380011 A -- / Write Back # Controller Cache FW NVRAM BIOS SMOR Serial --------------------------------------------------------------------------- d0 ADAP2400A 16MB 3A0L CHNL 1.1 1.62 1.12/79I BF0B111Z0B4 # Controller Status Voltage Current Full Cap Rem Cap Rem Time --------------------------------------------------------------------------- d0 ADAP2400A No battery Address Manufacturer/Model FW Serial 123456789012 --------------------------------------------------------------------------- d0b0t0d0 ST380011 A 3.06 5JVAYH4G -X-XX--X-O-- d0b1t0d0 ST380011 A 3.06 5JVB4AY9 -X-XX--X-O-- d0b2t0d0 ST380011 A 3.06 3JV8XK0N -X-XX--X-O-- d0b3t0d0 ST380011 A 3.06 3JV8VS5K -X-XX--X-O-- Capabilities Map: Column 1 = Soft Reset Column 2 = Cmd Queuing Column 3 = Linked Cmds Column 4 = Synchronous Column 5 = Wide 16 Column 6 = Wide 32 Column 7 = Relative Addr Column 8 = SCSI II Column 9 = S.M.A.R.T. Column 0 = SCAM Column 1 = SCSI-3 Column 2 = SAF-TE X = Capability Exists, - = Capability does not exist, O = Not Supported The key point is the 'Failed drive'. I happen to have an identical drive here, just sitting around, for just this event. |
But before I do that, let us try a rebuild
|
Let's try a rebuild first: raidutil -a rebuild d0 d0b0t0d0 As I type this, the system is now in this state (I will show only a small extract from the full output): $ sudo raidutil -L all Password: RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FW NVRAM Serial Status --------------------------------------------------------------------------- d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal Physical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive Logical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Reconstruct 3% d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive ... As you can see, we are at about 3% on the rebuild. Time to wait and see. It is now 15:03 EST. |
12 hours later...
|
Twelve hours later, NetSaint sent me this email: ***** NetSaint 0.0.7 ***** Notification Type: RECOVERY Service: RAID Host: polo Address: polo.unixathome.org State: OK Date/Time: Tue Jul 25 03:49:49 EDT 2006 Additional Info: Optimal The rebuild took about 13 hours all up... I'm glad the machine was online during that time. The machine in question runs the FreshPorts BETA site and is my main development server. |
What caused the problem?
|
I don't know what caused the problem. I know some of the symptoms.
After about 20 or 30 minutes trying to get the system going, I rebooted it. Of course, this would degrade the RAID array, and I wanted to avoid that. I saw no other options. I rebooted the box. It was suggested that one drive may have been experiencing an error. HDD try to solve errors and can take a long time attempting to recover. The RAID card can see this and just waits. No I/O occurs during this time. Western Digital has drives which are designed for RAID and feature TLER (Time Limited Error Recovery). Such features have been available on SCSI drives for quite some time. For what it's worth, the drives I'm planning to buy for the Dual Opteron server will have TLER. Ideas? Suggestions? Comments? Please use the comments link to the right. |