The FreeBSD Diary

(TM)

Providing practical examples since 1998

If you buy from Amazon USA, please support us by using this link.

RAID-5 drive failure 25 July 2006

Need more help on this topic? Click here
This article has 1 comment
Show me similar articles

Back in December 2004, I wrote about implementing hardware RAID-5. Yesterday, one of the drives in that cluster failed.

Here is the output which shows a failed drive:

$ raidutil -L all
RAIDUTIL  Version: 3.04  Date: 9/27/2000  FreeBSD CLI Configuration Utility
Adaptec ENGINE  Version: 3.04  Date: 9/27/2000  Adaptec FreeBSD SCSI Engine

#  b0 b1 b2  Controller     Cache  FW    NVRAM     Serial     Status
---------------------------------------------------------------------------
d0 -- -- --  ADAP2400A      16MB   3A0L  CHNL 1.1  BF0B111Z0B4Optimal

Physical View
Address    Type              Manufacturer/Model         Capacity  Status
---------------------------------------------------------------------------
d0b0t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal
d0b1t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal
d0b2t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal
d0b3t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Failed drive

Logical View
Address       Type              Manufacturer/Model      Capacity  Status
---------------------------------------------------------------------------
d0b0t0d0      RAID 5 (Redundant ADAPTEC  RAID-5         228957MB  Degraded
 d0b1t0d0     Disk Drive (DASD) ST380011 A              76319MB   Optimal
 d0b0t0d0     Disk Drive (DASD) ST380011 A              76319MB   Optimal
 d0b2t0d0     Disk Drive (DASD) ST380011 A              76319MB   Optimal
 d0b3t0d0     Disk Drive (DASD) ST380011 A              76319MB   Failed drive


Address    Max Speed  Actual Rate / Width
---------------------------------------------------------------------------
d0b0t0d0   50 MHz     100 MB/sec    wide
d0b1t0d0   50 MHz     100 MB/sec    wide
d0b2t0d0   50 MHz     100 MB/sec    wide
d0b3t0d0   10 MHz     100 MB/sec    wide

Address    Manufacturer/Model        Write Cache Mode (HBA/Device)
---------------------------------------------------------------------------
d0b0t0d0   ADAPTEC  RAID-5           Write Back / --
 d0b1t0d0  ST380011 A                -- / Write Back
 d0b0t0d0  ST380011 A                -- / Write Back
 d0b2t0d0  ST380011 A                -- / Write Back
 d0b3t0d0  ST380011 A                -- / Write Back

#  Controller     Cache  FW    NVRAM     BIOS   SMOR      Serial
---------------------------------------------------------------------------
d0 ADAP2400A      16MB   3A0L  CHNL 1.1  1.62   1.12/79I  BF0B111Z0B4

#  Controller      Status     Voltage  Current  Full Cap  Rem Cap  Rem Time
---------------------------------------------------------------------------
d0 ADAP2400A       No battery

Address    Manufacturer/Model        FW          Serial        123456789012
---------------------------------------------------------------------------
d0b0t0d0   ST380011 A                3.06 5JVAYH4G             -X-XX--X-O--
d0b1t0d0   ST380011 A                3.06 5JVB4AY9             -X-XX--X-O--
d0b2t0d0   ST380011 A                3.06 3JV8XK0N             -X-XX--X-O--
d0b3t0d0   ST380011 A                3.06 3JV8VS5K             -X-XX--X-O--

Capabilities Map:  Column 1 = Soft Reset
                   Column 2 = Cmd Queuing
                   Column 3 = Linked Cmds
                   Column 4 = Synchronous
                   Column 5 = Wide 16
                   Column 6 = Wide 32
                   Column 7 = Relative Addr
                   Column 8 = SCSI II
                   Column 9 = S.M.A.R.T.
                   Column 0 = SCAM
                   Column 1 = SCSI-3
                   Column 2 = SAF-TE
   X = Capability Exists, - = Capability does not exist, O = Not Supported

The key point is the 'Failed drive'. I happen to have an identical drive here, just sitting around, for just this event.

But before I do that, let us try a rebuild

Let's try a rebuild first:

raidutil -a rebuild d0 d0b0t0d0

As I type this, the system is now in this state (I will show only a small extract from the full output):

 $ sudo raidutil -L all
 Password:
 RAIDUTIL  Version: 3.04  Date: 9/27/2000  FreeBSD CLI Configuration Utility
 Adaptec ENGINE  Version: 3.04  Date: 9/27/2000  Adaptec FreeBSD SCSI Engine
 
 #  b0 b1 b2  Controller     Cache  FW    NVRAM     Serial     Status
 ---------------------------------------------------------------------------
 d0 -- -- --  ADAP2400A      16MB   3A0L  CHNL 1.1  BF0B111Z0B4Optimal
 
 Physical View
 Address    Type              Manufacturer/Model         Capacity  Status
 ---------------------------------------------------------------------------
 d0b0t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal
 d0b1t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal
 d0b2t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal
 d0b3t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Replaced Drive
 
 Logical View
 Address       Type              Manufacturer/Model      Capacity  Status
 ---------------------------------------------------------------------------
 d0b0t0d0     RAID 5 (Redundant ADAPTEC  RAID-5         228957MB  Reconstruct 3%
 d0b1t0d0     Disk Drive (DASD) ST380011 A              76319MB   Optimal
 d0b0t0d0     Disk Drive (DASD) ST380011 A              76319MB   Optimal
 d0b2t0d0     Disk Drive (DASD) ST380011 A              76319MB   Optimal
 d0b3t0d0     Disk Drive (DASD) ST380011 A              76319MB   Replaced Drive
 ...

As you can see, we are at about 3% on the rebuild. Time to wait and see. It is now 15:03 EST.

12 hours later...

Twelve hours later, NetSaint sent me this email:

***** NetSaint 0.0.7 *****

Notification Type: RECOVERY

Service: RAID
Host: polo
Address: polo.unixathome.org
State: OK

Date/Time: Tue Jul 25 03:49:49 EDT 2006

Additional Info: Optimal

The rebuild took about 13 hours all up... I'm glad the machine was online during that time. The machine in question runs the FreshPorts BETA site and is my main development server.

What caused the problem?

I don't know what caused the problem. I know some of the symptoms.

The box did not respond to pings
telnet to port 22 gave a standard SSH banner
attempts to ssh were unsuccessful with no login prompt being provided
Console was sluggish
When pressing ALT-F3 to go to another vtty, nothing happened. When I returned to the console some minutes later, I noticed I was now on the other vtty (sluggish).
Attempts to login via that tty showed no response. It may have just been sluggish
There are no entries in the log

After about 20 or 30 minutes trying to get the system going, I rebooted it. Of course, this would degrade the RAID array, and I wanted to avoid that. I saw no other options. I rebooted the box.

It was suggested that one drive may have been experiencing an error. HDD try to solve errors and can take a long time attempting to recover. The RAID card can see this and just waits. No I/O occurs during this time. Western Digital has drives which are designed for RAID and feature TLER (Time Limited Error Recovery). Such features have been available on SCSI drives for quite some time. For what it's worth, the drives I'm planning to buy for the Dual Opteron server will have TLER.

Ideas? Suggestions? Comments? Please use the comments link to the right.

Need more help on this topic? Click here
This article has 1 comment
Show me similar articles