The FreeBSD Diary |
(TM) | Providing practical examples since 1998If you buy from Amazon USA, please support us by using this link. |
NetSaint plugin for 3Ware RAID card
18 August 2006
|
I recently set up a RAID10 array on a 9550SX-8LP controller donated by 3Ware. RAID is not a panacea, solve all your problems, I can ignore the machine, backups are no longer required, solution. You must monitor your RAID array[s] just like any other service on your machine. I've been using NetSaint since 2001. Development of NetSaint is being continued under a new name - Nagios. I have no reason to move to Ngaios, so I continue with NetSaint. A previous article shows how I created a plug-in for another RAID card. I'll be using a similar approach for this plug-in. I have previously written about the 3Ware CLI interface and will be using the CLI as the foundation for this plug-in. |
The plug-in components
|
I am assuming you already have NetSaint installed, configured, and operational. I documented my installation and that will help you get started. During the construction of this plug-in, we will deal with three main components of the NetSaint system. I have described the changes required in [brackets].
We have three units attached to this controller: # tw_cli info c0 Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 SPARE OK - - 69.2404 - OFF - u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 OK u2 69.25 GB 145226112 WD-WMAKE23790 p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 OK u0 69.25 GB 145226112 WD-WMAKE23790 p7 OK u1 69.25 GB 145226112 WD-WMAKE23786 Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 0 xx-xxx-xxxx What I'd like to do is monitor the status of those three units. We could get all fancy and create a generic plug-in that would work with any number of units, ports, and drives. But I'm not going to do that here. I'm going to create a script that will monitor u0, u1, and u2. |
Getting the information we need
|
There is a command that will produce very concise status output: # tw_cli info c0 u0 status /c0/u0 status = OK # All we need to do is capture the fourth field in this output. That can be done easily with awk: # tw_cli info c0 u0 status | awk '{print $4}' OK # The above outputs the bare minimum. Perhaps we want more. Such as this: # tw_cli info c0 u0 Unit UnitType Status %Cmpl Port Stripe Size(GB) Blocks ----------------------------------------------------------------------- u0 SPARE OK - p6 - 69.2404 145207680 # A slightly different approach will give better results. Consider this output: # tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 SPARE OK - - 69.2404 - OFF - u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF With this command, we have all the information we need for all units. This is a better approach. |
What about drive removal?
|
The above examples are from the normal situation. What happens if we remove a drive. Here is the output if I remove drive 6 (u0 -SPARE): # tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF Unit u0 is gone. If we query the controller, we get more information: # tw_cli info c0 Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 OK u2 69.25 GB 145226112 WD-WMAKE23790 p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 DRIVE-REMOVED - - - - p7 OK u1 69.25 GB 145226112 WD-WMAKE23786 Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 0 xx-xxx-xxxx We can see that drive 6 has been removed. We must code our script accordingly. If I replace the drive, the status returns to normal. In addition, I found these entries in /var/log/messages: twa0: WARNING: (0x04: 0x0019): Drive removed: port=6 twa0: INFO: (0x04: 0x001A): Drive inserted: port=6 That's OK for a spare. What about a drive from the RAID array? Let's try drive 4. # tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 SPARE OK - - 69.2404 - OFF - u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 DEGRADED - 64K 195.548 OFF OFF OFF Good. That's what one would expect. Further inquiries show that one of the hot spares has been taken into production: # tw_cli info c0 Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 REBUILD-PAUSED 66 64K 195.548 OFF OFF OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 DRIVE-REMOVED - - - - p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 DEGRADED u2 69.25 GB 145226112 WD-WMAKE23790 p7 OK u1 69.25 GB 145226112 WD-WMAKE23786 Name OnlineState BBUReady Status Volt Temp Hours LastCapTest --------------------------------------------------------------------------- bbu On Yes OK OK OK 0 xx-xxx-xxxx Checking a few minutes later and you can see that it's rebuilding: # tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 REBUILDING 69 64K 195.548 OFF OFF OFF I'll just go plug drive 4 back in.... Now we see this in /var/log/messages: Aug 14 11:50:23 opti kernel: twa0: WARNING: (0x04: 0x0019): Drive removed: port=4 Aug 14 11:50:23 opti kernel: twa0: ERROR: (0x04: 0x0002): Degraded unit: unit=2, port=4 Aug 14 11:53:01 opti kernel: twa0: INFO: (0x04: 0x000B): Rebuild started: unit=2 Aug 14 11:56:11 opti kernel: twa0: INFO: (0x04: 0x001A): Drive inserted: port=4 After the unit has finished rebuilding, the drive status looked like this: # tw_cli info c0 drivestatus Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 OK u? 69.25 GB 145226112 WD-WMAKE23790 p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 OK u2 69.25 GB 145226112 WD-WMAKE23790 p7 OK u1 69.25 GB 145226112 WD-WMAKE23786 Note that drive 4 is part of unit u?. I issued a rescan and then saw this: # tw_cli info c0 drivestatus Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u2 69.25 GB 145226112 WD-WMAKE23790 p1 OK u2 69.25 GB 145226112 WD-WMAKE23790 p2 OK u2 69.25 GB 145226112 WD-WMAKE23943 p3 OK u2 69.25 GB 145226112 WD-WMAKE23790 p4 OK u0 69.25 GB 145226112 WD-WMAKE23790 p5 OK u2 69.25 GB 145226112 WD-WMAKE23792 p6 OK u2 69.25 GB 145226112 WD-WMAKE23790 p7 OK u1 69.25 GB 145226112 WD-WMAKE23786 Good, now it's back on u0. Let's look at the unit status: # tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 RAID-10 INOPERABLE - 64K 195.548 OFF OFF OFF u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF OK, well, that's not idea, but it does reflect what was going on. What we need to do is delete that unit and re-add it as a hot spare: # tw_cli //opti> /c0/u0 delete Error: (CLI:038) Invalid unit command. //opti> /c0/u0 del Deleting /c0/u0 will cause the data on the unit permanently loss. Do you want to continue ? Y|N [N]: y Deleting unit c0/u0 ...Done. //opti> /c0 add type=spare disk=4 Creating new unit on controller /c0 ... Failed. Error: (API:0012) Disk is member of un-exported unit. //opti> /c0/p4 export Exporting /c0/p4 will take the disk offline. Do you want to continue ? Y|N [N]: yes Exporting port /c0/p4 ... Done. //opti> Ummm, what? This was a problem. For several hours. Eventually, I upgrade the firmware. Then things started working as expected: # //opti/c0> /c0 add type=spare disk=3 Creating new unit on controller /c0 ... Done. The new unit is /c0/u1. //opti/c0> /c0 add type=spare disk=6 Creating new unit on controller /c0 ... Done. The new unit is /c0/u2. //opti/c0> You will note that in the above I used disks 3 and 6, and not 4 as in a previous example. This is because I did several disk removals and adds during the diagnosis of this problem. The full text of everything I did is available here (41KB). It is interesting to note that as hot spares were taken up, units were renumbered. The RAID-10 array started as u2. It is now u0. |
How to process the data
|
What us is it having the information if you don't know what to do with it? I wasn't sure how to use all this data. I looked at how the disk checking routine handled it. If you look at the output, it's pretty simple: sub raid3ware { my $controller = shift; my $unitlisting; my $command = "$commandlist{$os}{raid3ware}"; $command =~ s/XXX/$controller/g; open(PROCOUT, "$command |") || die; $_ = <PROCOUT>; while($_ = <PROCOUT>) { if (/^(u\S+)\s+(\S+)\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s*/) { $unitlisting .= '(' . $1 . ',' . $2 . ',' . $3 .')'; } } if (defined($unitlisting)) { print Client $unitlisting; } else { print Client "no units?"; } $unitlisting = undef; close(PROCOUT); } That fancy regex, which I spent considerable time playing with, can probably be replaced with a call to split(). If you want the patch for the above, it is based upon netsaint_statd_v2.15 and is available here. This function is designed to sit on the RAID server (the one with the 3Ware card). It will be invoked by the netsaint_statd daemon. The script accepts one parameter: the controller id (usually c0 if you have just one controller). In the next section, I'll show you how I pass that parameter from NetSaint to the script on the server. I wrote the above function while sitting in a Second Cup cafe, waiting for a couple of women to finish their massages next door at The Spa. What's unusual about that? Nothing in particular, except that I had no access to the server from that location (at least not without paying for WiFi, which I was not going to do). So I faked the call to tw_cli by creating this perl script: $ cat /home/dan/src/netsaint_statd/test.pl #!/usr/bin/perl $ans = " Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 SPARE OK - - 69.2404 - OFF - u1 SPARE OK - - 69.2404 - OFF - u2 RAID-10 OK - 64K 195.548 ON OFF OFF "; print $ans; With this handy little script, I was able to develop the NetSaint side of the script quickly and easily. I was also able to modify the Status fields without actually having to affect the server. Useful! I will show you how to modify netsaint_statd later. Next, how do we modify NetSaint? |
Getting the data from the server into netsaint_statd
|
netsaint_statd is the daemon that can be installed on remote systems. NetSaint talks to this daemon to extract information from the remote system. In this context, remote can mean on the same LAN/WAN, etc. We need a script that will query the server and grab the 3Ware information from it. This is it: #!/usr/bin/perl # # See LICENSE for copyright information # # check_3wareraid.pl <host> # # NetSaint host script to get the 3ware RAID status from a client that is running # netsaint_statd. # require 5.003; BEGIN { $ENV{PATH} = '/bin' } use Socket; use POSIX; sub usage; my $TIMEOUT = 15; my %ERRORS = ('UNKNOWN', '-1', 'OK', '0', 'WARNING', '1', 'CRITICAL', '2'); my $remote = shift || &usage(%ERRORS); my $controller = shift || &usage(%ERRORS); my $unitarg = shift || &usage(%ERRORS); my $port = shift || 1040; my $remoteaddr = inet_aton("$remote"); my $paddr = sockaddr_in($port, $remoteaddr) || die "Can't create info for connection: #!\n";; my $proto = getprotobyname('tcp'); socket(Server, PF_INET, SOCK_STREAM, $proto) || die "Can't create socket: $!"; setsockopt(Server, SOL_SOCKET, SO_REUSEADDR, 1); connect(Server, $paddr) || die "Can't connect to server: $!"; my $state = "OK"; my $answer = undef; # Just in case of problems, let's not hang NetSaint $SIG{'ALRM'} = sub { close(Server); select(STDOUT); print "No Answer from Client\n"; exit $ERRORS{"UNKNOWN"}; }; alarm($TIMEOUT); #print "invoking Server with:raid3wareunits $controller\n"; select(Server); $| = 1; print Server "raid3wareunits $controller\n"; my ($servanswer) = <Server>; alarm(0); close(Server); select(STDOUT); chomp($servanswer); #print "REPLY: '$servanswer'\n"; $servanswer =~ s/\(//g; my @servanswer = split(/\)/,$servanswer); $answer = 'not found'; $state = 'CRITICAL'; foreach $line (@servanswer) { my ($unit, $name, $status) = split(/,/, $line); if ($unit eq $unitarg) { if ($status =~ m%^REBUILDING%) { $state = "WARNING"; $answer = $status; } else { if ($status =~ m%^DEGRADED%) { $state = "CRITICAL"; $answer = $status; } else { if ($status =~ m%^OK%) { $state = "OK"; $answer = $status; } else { $answer = $status; $state = "CRITICAL"; } } } } } print $answer; exit $ERRORS{$state}; sub usage { print "Minimum arguments not supplied!\n"; print "\n"; print "Perl Check Users plugin for NetSaint\n"; print "Copyright (c) 1999 Charlie Cook & Nick Reinking\n"; print "Copyright (c) 2006 Dan Langille\n"; print "\n"; print "Usage: $0 <host> <controller> <unit>\n"; print "\n"; exit $ERRORS{"UNKNOWN"}; } Look for this line: That is the line that tells netsaint_statd to invoke the raid3ware command and pass it the $controller parameter.print Server "raid3ware $controller\n"; This script is based upon one I found included with the NetSaint plug-ins. I used it as a base and went from there. I can't recall which script I started with, but they all have a very similar structure. I placed this script at /usr/local/libexec/netsaint/netsaint_statd/ on my NetSaint server.
|
Configuring NetSaint server to use the plug-in
|
In this section, I'll show you how I modified my NetSaint server installation to add monitoring support for the 3Ware plug-in. The files to be modified are:
command[check_raid3ware.pl]=$USER1$/netsaint_statd/check_3wareraidunits.pl \ $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ In hosts.cfg, here are the entries that relate to monitoring the 3Ware RAID on the dual opteron server: service[opti]=RAID spare 1;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u1 service[opti]=RAID spare 2;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u2 service[opti]=RAID array ;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u0 After issuing a /usr/local/etc/rc.d/netsaint.sh reload command, and waiting a short while for NetSaint to run its queries, I found the following in my NetSaint monitoring website: |
Testing the plug-in
|
It is one thing to write a script, test it, and put it into production. It is an entirely different thing to test it in production. I tested this in production by removing drives from the server. This server has hot-swappable drives. By removing one, I can verify that the plug-in is working as expected. Here is NetSaint after I removed a spare drive: I then replaced the drive I had removed and issued this command: # tw_cli rescan Rescanning controller /c0 for units and drives ...Done. Found the following unit(s): [/c0/u2]. Found the following drive(s): [none]. A short time later, NetSaint reported all was well. Next, I removed a drive from the RAID cluster. NetSaint then displayed this: At this point in time, NetSaint noticed that the RAID array was degraded because of the missing drive. The 3Ware controller had already pulled a hot spare into the array. NetSaint hadn't yet polled the spare. A short time later, NetSaint was reporting this: The above shows that a hot spare has been pulled into the array and that the array is rebuilding. With a bit more work, the plugin could display the percentage completed. Now it's time to put the pulled drive back into the system, delete the contents, and add it back in as a hot spare. I knew it was drive 1, so the following scan confirms what I knew: # tw_cli rescan Rescanning controller /c0 for units and drives ...Done. Found the following unit(s): [/c0/u1]. Found the following drive(s): [none]. I can see what is rebuilding: # tw_cli info c0 u0 Unit UnitType Status %Cmpl Port Stripe Size(GB) Blocks ----------------------------------------------------------------------- u0 RAID-10 REBUILDING 87 - 64K 195.548 410093568 u0-0 RAID-1 REBUILDING 61 - - - - u0-0-0 DISK OK - p0 - 65.1826 136697856 u0-0-1 DISK DEGRADED - p6 - 65.1826 136697856 u0-1 RAID-1 OK - - - - - u0-1-0 DISK OK - p2 - 65.1826 136697856 u0-1-1 DISK OK - p4 - 65.1826 136697856 u0-2 RAID-1 OK - - - - - u0-2-0 DISK OK - p3 - 65.1826 136697856 u0-2-1 DISK OK - p5 - 65.1826 136697856 Drive 6 (p6 as show above) has been pulled into the cluster. By issuing a unitstatus command, I can confirm that u1 is inoperable. That would be the drive I just removed and replaced. # tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 RAID-10 REBUILDING 90 64K 195.548 OFF OFF OFF u1 RAID-10 INOPERABLE - 64K 195.548 OFF OFF OFF u2 SPARE OK - - 69.2404 - OFF - Since this drive came from the array, it contains META data that identifies it as part of the array. That META data needs to be erased before the 3Ware controller will accept it as a hot spare. With the following commands, I delete the inoperable unit, add the drive (from that unit) back as a hot spare, and then list the unitstatus. # tw_cli /c0/u1 del Deleting /c0/u1 will cause the data on the unit permanently loss. Do you want to continue ? Y|N [N]: y Deleting unit c0/u1 ...Done. # tw_cli /c0 add type=spare disk=1 Creating new unit on controller /c0 ... Done. The new unit is /c0/u1. # tw_cli info c0 unitstatus Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC ------------------------------------------------------------------------------ u0 RAID-10 OK - 64K 195.548 ON OFF OFF u1 SPARE OK - - 69.2404 - OFF - u2 SPARE OK - - 69.2404 - OFF NetSaint then displaying this status: All is well. |
A final word
|
RAID is not a backup. Be sure to perform backups on your RAID arrays as you would for any computer without RAID. Monitor your RAID array and watch for problems. Fix them as soon as you can. Following this strategy, your RAID solution should provide you with less down time as a result of any disk failure. Enjoy. |