The FreeBSD Diary

Providing practical examples since 1998

If you buy from Amazon USA, please support us by using this link.

NetSaint plugin for 3Ware RAID card 18 August 2006

Need more help on this topic? Click here
This article has no comments
Show me similar articles

I recently set up a RAID10 array on a 9550SX-8LP controller donated by 3Ware. RAID is not a panacea, solve all your problems, I can ignore the machine, backups are no longer required, solution. You must monitor your RAID array[s] just like any other service on your machine. I've been using NetSaint since 2001. Development of NetSaint is being continued under a new name - Nagios. I have no reason to move to Ngaios, so I continue with NetSaint.

A previous article shows how I created a plug-in for another RAID card. I'll be using a similar approach for this plug-in. I have previously written about the 3Ware CLI interface and will be using the CLI as the foundation for this plug-in.

The plug-in components

I am assuming you already have NetSaint installed, configured, and operational. I documented my installation and that will help you get started. During the construction of this plug-in, we will deal with three main components of the NetSaint system. I have described the changes required in [brackets].

netsaint_statd - which provides remote monitoring of hosts [patch this code so it knows about the 3Ware script]
commands.cfg - specifies what services should be monitored by NetSaint [add the RAID service]
a new script, for use by netsaint_statd, which pulls data from the 3Ware CLI [write the script]

We have three units attached to this controller:

# tw_cli info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    SPARE     OK             -      -       69.2404   -      OFF      -
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   OK             -      64K     195.548   ON     OFF      OFF

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p1     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p2     OK               u2     69.25 GB    145226112     WD-WMAKE23943
p3     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p4     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p5     OK               u2     69.25 GB    145226112     WD-WMAKE23792
p6     OK               u0     69.25 GB    145226112     WD-WMAKE23790
p7     OK               u1     69.25 GB    145226112     WD-WMAKE23786

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx

What I'd like to do is monitor the status of those three units. We could get all fancy and create a generic plug-in that would work with any number of units, ports, and drives. But I'm not going to do that here. I'm going to create a script that will monitor u0, u1, and u2.

Getting the information we need

There is a command that will produce very concise status output:

# tw_cli info c0 u0 status
/c0/u0 status = OK

#

All we need to do is capture the fourth field in this output. That can be done easily with awk:

# tw_cli info c0 u0 status | awk '{print $4}'
OK

#

The above outputs the bare minimum. Perhaps we want more. Such as this:

# tw_cli info c0 u0

Unit     UnitType  Status         %Cmpl  Port  Stripe  Size(GB)  Blocks
-----------------------------------------------------------------------
u0       SPARE     OK             -      p6    -       69.2404   145207680

#

A slightly different approach will give better results. Consider this output:

# tw_cli info c0 unitstatus

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    SPARE     OK             -      -       69.2404   -      OFF      -
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   OK             -      64K     195.548   ON     OFF      OFF

With this command, we have all the information we need for all units. This is a better approach.

What about drive removal?

The above examples are from the normal situation. What happens if we remove a drive. Here is the output if I remove drive 6 (u0 -SPARE):

# tw_cli info c0 unitstatus

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   OK             -      64K     195.548   ON     OFF      OFF

Unit u0 is gone. If we query the controller, we get more information:

# tw_cli info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   OK             -      64K     195.548   ON     OFF      OFF

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p1     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p2     OK               u2     69.25 GB    145226112     WD-WMAKE23943
p3     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p4     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p5     OK               u2     69.25 GB    145226112     WD-WMAKE23792
p6     DRIVE-REMOVED    -      -           -             -
p7     OK               u1     69.25 GB    145226112     WD-WMAKE23786

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx

We can see that drive 6 has been removed. We must code our script accordingly. If I replace the drive, the status returns to normal. In addition, I found these entries in /var/log/messages:

twa0: WARNING: (0x04: 0x0019): Drive removed: port=6
twa0: INFO: (0x04: 0x001A): Drive inserted: port=6

That's OK for a spare. What about a drive from the RAID array? Let's try drive 4.

# tw_cli info c0 unitstatus

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    SPARE     OK             -      -       69.2404   -      OFF      -
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   DEGRADED       -      64K     195.548   OFF    OFF      OFF

Good. That's what one would expect. Further inquiries show that one of the hot spares has been taken into production:

# tw_cli info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   REBUILD-PAUSED 66     64K     195.548   OFF    OFF      OFF

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p1     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p2     OK               u2     69.25 GB    145226112     WD-WMAKE23943
p3     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p4     DRIVE-REMOVED    -      -           -             -
p5     OK               u2     69.25 GB    145226112     WD-WMAKE23792
p6     DEGRADED         u2     69.25 GB    145226112     WD-WMAKE23790
p7     OK               u1     69.25 GB    145226112     WD-WMAKE23786

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx

Checking a few minutes later and you can see that it's rebuilding:

# tw_cli info c0 unitstatus

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   REBUILDING     69     64K     195.548   OFF    OFF      OFF

I'll just go plug drive 4 back in....

Now we see this in /var/log/messages:

Aug 14 11:50:23 opti kernel: twa0: WARNING: (0x04: 0x0019): Drive removed: port=4
Aug 14 11:50:23 opti kernel: twa0: ERROR: (0x04: 0x0002): Degraded unit: unit=2, port=4
Aug 14 11:53:01 opti kernel: twa0: INFO: (0x04: 0x000B): Rebuild started: unit=2
Aug 14 11:56:11 opti kernel: twa0: INFO: (0x04: 0x001A): Drive inserted: port=4

After the unit has finished rebuilding, the drive status looked like this:

# tw_cli info c0 drivestatus
Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p1     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p2     OK               u2     69.25 GB    145226112     WD-WMAKE23943
p3     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p4     OK               u?     69.25 GB    145226112     WD-WMAKE23790
p5     OK               u2     69.25 GB    145226112     WD-WMAKE23792
p6     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p7     OK               u1     69.25 GB    145226112     WD-WMAKE23786

Note that drive 4 is part of unit u?. I issued a rescan and then saw this:

# tw_cli info c0 drivestatus

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p1     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p2     OK               u2     69.25 GB    145226112     WD-WMAKE23943
p3     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p4     OK               u0     69.25 GB    145226112     WD-WMAKE23790
p5     OK               u2     69.25 GB    145226112     WD-WMAKE23792
p6     OK               u2     69.25 GB    145226112     WD-WMAKE23790
p7     OK               u1     69.25 GB    145226112     WD-WMAKE23786

Good, now it's back on u0. Let's look at the unit status:

# tw_cli info c0 unitstatus

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-10   INOPERABLE     -      64K     195.548   OFF    OFF      OFF
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   OK             -      64K     195.548   ON     OFF      OFF

OK, well, that's not idea, but it does reflect what was going on. What we need to do is delete that unit and re-add it as a hot spare:

# tw_cli
//opti> /c0/u0 delete

Error: (CLI:038) Invalid unit command.
//opti> /c0/u0 del
Deleting /c0/u0 will cause the data on the unit permanently loss.
Do you want to continue ? Y|N [N]: y
Deleting unit c0/u0 ...Done.


//opti> /c0 add type=spare disk=4
Creating new unit on controller /c0 ...  Failed.

Error: (API:0012) Disk is member of un-exported unit.
//opti> /c0/p4 export
Exporting /c0/p4 will take the disk offline.
Do you want to continue ? Y|N [N]: yes
Exporting port /c0/p4 ... Done.


//opti>

Ummm, what? This was a problem. For several hours. Eventually, I upgrade the firmware. Then things started working as expected:

#
//opti/c0> /c0 add type=spare disk=3
Creating new unit on controller /c0 ...  Done. The new unit is /c0/u1.

//opti/c0> /c0 add type=spare disk=6
Creating new unit on controller /c0 ...  Done. The new unit is /c0/u2.

//opti/c0>

You will note that in the above I used disks 3 and 6, and not 4 as in a previous example. This is because I did several disk removals and adds during the diagnosis of this problem. The full text of everything I did is available here (41KB). It is interesting to note that as hot spares were taken up, units were renumbered. The RAID-10 array started as u2. It is now u0.

How to process the data

What us is it having the information if you don't know what to do with it? I wasn't sure how to use all this data. I looked at how the disk checking routine handled it. If you look at the output, it's pretty simple:

sub raid3ware {
   my $controller = shift;

   my $unitlisting;
   my $command = "$commandlist{$os}{raid3ware}";
   $command =~ s/XXX/$controller/g;

   open(PROCOUT, "$command |") || die;
   $_ = <PROCOUT>;
   while($_ = <PROCOUT>) {
      if (/^(u\S+)\s+(\S+)\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s*/) {
         $unitlisting .= '(' . $1 . ','  . $2 . ',' . $3 .')';
      }
   }
   if (defined($unitlisting)) {
      print Client $unitlisting;
   } else {
      print Client "no units?";
   }

   $unitlisting = undef;
   close(PROCOUT);
}

That fancy regex, which I spent considerable time playing with, can probably be replaced with a call to split(). If you want the patch for the above, it is based upon netsaint_statd_v2.15 and is available here.

This function is designed to sit on the RAID server (the one with the 3Ware card). It will be invoked by the netsaint_statd daemon. The script accepts one parameter: the controller id (usually c0 if you have just one controller). In the next section, I'll show you how I pass that parameter from NetSaint to the script on the server.

I wrote the above function while sitting in a Second Cup cafe, waiting for a couple of women to finish their massages next door at The Spa. What's unusual about that? Nothing in particular, except that I had no access to the server from that location (at least not without paying for WiFi, which I was not going to do). So I faked the call to tw_cli by creating this perl script:

$ cat /home/dan/src/netsaint_statd/test.pl
#!/usr/bin/perl

$ans = "
Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    SPARE     OK             -      -       69.2404   -      OFF      -
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    RAID-10   OK             -      64K     195.548   ON     OFF      OFF

";

print $ans;

With this handy little script, I was able to develop the NetSaint side of the script quickly and easily. I was also able to modify the Status fields without actually having to affect the server. Useful!

I will show you how to modify netsaint_statd later. Next, how do we modify NetSaint?

Getting the data from the server into netsaint_statd

netsaint_statd is the daemon that can be installed on remote systems. NetSaint talks to this daemon to extract information from the remote system. In this context, remote can mean on the same LAN/WAN, etc.

We need a script that will query the server and grab the 3Ware information from it. This is it:

#!/usr/bin/perl
#
# See LICENSE for copyright information
#
# check_3wareraid.pl <host>
#
# NetSaint host script to get the 3ware RAID status from a client that is running
# netsaint_statd.
#

require 5.003;
BEGIN { $ENV{PATH} = '/bin' }
use Socket;
use POSIX;

sub usage;

my $TIMEOUT = 15;

my %ERRORS = ('UNKNOWN', '-1',
		'OK', '0',
		'WARNING', '1',
		'CRITICAL', '2');
my $remote     = shift || &usage(%ERRORS);
my $controller = shift || &usage(%ERRORS);
my $unitarg    = shift || &usage(%ERRORS);
my $port       = shift || 1040;

my $remoteaddr = inet_aton("$remote");
my $paddr      = sockaddr_in($port, $remoteaddr) || die "Can't create info for connection: #!\n";;
my $proto      = getprotobyname('tcp');

socket(Server, PF_INET, SOCK_STREAM, $proto) || die "Can't create socket: $!";
setsockopt(Server, SOL_SOCKET, SO_REUSEADDR, 1);
connect(Server, $paddr) || die "Can't connect to server: $!";

my $state = "OK";
my $answer = undef;

# Just in case of problems, let's not hang NetSaint
$SIG{'ALRM'} = sub { 
     close(Server);
     select(STDOUT);
     print "No Answer from Client\n";
     exit $ERRORS{"UNKNOWN"};
};
alarm($TIMEOUT);

#print "invoking Server with:raid3wareunits $controller\n";

select(Server);
$| = 1;

print Server "raid3wareunits $controller\n";
my ($servanswer) = <Server>;
alarm(0);
close(Server);
select(STDOUT);

chomp($servanswer);

#print "REPLY: '$servanswer'\n";

$servanswer =~ s/\(//g;
my @servanswer = split(/\)/,$servanswer);

$answer = 'not found';
$state  = 'CRITICAL';

foreach $line (@servanswer) {
	my ($unit, $name, $status) = split(/,/, $line);
	if ($unit eq $unitarg) {
		if ($status =~ m%^REBUILDING%) {
			$state  = "WARNING";
			$answer = $status;
		} else {
			if ($status =~ m%^DEGRADED%) {
				$state  = "CRITICAL";
				$answer = $status;
			} else {
				if ($status =~ m%^OK%) {
					$state  = "OK";
					$answer = $status;
				} else {
					$answer = $status;
					$state = "CRITICAL";
				}
			}
		}
	}
}


print $answer;
exit $ERRORS{$state};

sub usage {
	print "Minimum arguments not supplied!\n";
        print "\n";
        print "Perl Check Users plugin for NetSaint\n";
        print "Copyright (c) 1999 Charlie Cook & Nick Reinking\n";
        print "Copyright (c) 2006 Dan Langille\n";
        print "\n";
        print "Usage: $0 <host> <controller> <unit>\n";
        print "\n";
	exit $ERRORS{"UNKNOWN"};
}

Look for this line:

print Server "raid3ware $controller\n";

That is the line that tells netsaint_statd to invoke the raid3ware command and pass it the $controller parameter.

This script is based upon one I found included with the NetSaint plug-ins. I used it as a base and went from there. I can't recall which script I started with, but they all have a very similar structure.

I placed this script at /usr/local/libexec/netsaint/netsaint_statd/ on my NetSaint server.

Configuring NetSaint server to use the plug-in

In this section, I'll show you how I modified my NetSaint server installation to add monitoring support for the 3Ware plug-in. The files to be modified are:

/usr/local/etc/netsaint/commands.cfg - add the new commands
/usr/local/etc/netsaint/hosts.cfg - add the new host and services to be monitored

In commands.cfg, I added this line to the end of the file (in the netsaint_statd remote commands section. The line is shown below split into two lines, to make it easier to read; move the \ and put it all on one line before pasting it into your commands.cfg file.

command[check_raid3ware.pl]=$USER1$/netsaint_statd/check_3wareraidunits.pl \
     $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$

In hosts.cfg, here are the entries that relate to monitoring the 3Ware RAID on the dual opteron server:

service[opti]=RAID spare 1;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u1
service[opti]=RAID spare 2;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u2
service[opti]=RAID array  ;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 u0

After issuing a /usr/local/etc/rc.d/netsaint.sh reload command, and waiting a short while for NetSaint to run its queries, I found the following in my NetSaint monitoring website:

Testing the plug-in

It is one thing to write a script, test it, and put it into production. It is an entirely different thing to test it in production. I tested this in production by removing drives from the server. This server has hot-swappable drives. By removing one, I can verify that the plug-in is working as expected.

Here is NetSaint after I removed a spare drive:

I then replaced the drive I had removed and issued this command:

# tw_cli rescan
Rescanning controller /c0 for units and drives ...Done.
Found the following unit(s): [/c0/u2].
Found the following drive(s): [none].

A short time later, NetSaint reported all was well.

Next, I removed a drive from the RAID cluster. NetSaint then displayed this:

At this point in time, NetSaint noticed that the RAID array was degraded because of the missing drive. The 3Ware controller had already pulled a hot spare into the array. NetSaint hadn't yet polled the spare. A short time later, NetSaint was reporting this:

The above shows that a hot spare has been pulled into the array and that the array is rebuilding. With a bit more work, the plugin could display the percentage completed. Now it's time to put the pulled drive back into the system, delete the contents, and add it back in as a hot spare. I knew it was drive 1, so the following scan confirms what I knew:

# tw_cli rescan
Rescanning controller /c0 for units and drives ...Done.
Found the following unit(s): [/c0/u1].
Found the following drive(s): [none].

I can see what is rebuilding:

# tw_cli info c0 u0

Unit     UnitType  Status         %Cmpl  Port  Stripe  Size(GB)  Blocks
-----------------------------------------------------------------------
u0       RAID-10   REBUILDING     87     -     64K     195.548   410093568
u0-0     RAID-1    REBUILDING     61     -     -       -         -
u0-0-0   DISK      OK             -      p0    -       65.1826   136697856
u0-0-1   DISK      DEGRADED       -      p6    -       65.1826   136697856
u0-1     RAID-1    OK             -      -     -       -         -
u0-1-0   DISK      OK             -      p2    -       65.1826   136697856
u0-1-1   DISK      OK             -      p4    -       65.1826   136697856
u0-2     RAID-1    OK             -      -     -       -         -
u0-2-0   DISK      OK             -      p3    -       65.1826   136697856
u0-2-1   DISK      OK             -      p5    -       65.1826   136697856

Drive 6 (p6 as show above) has been pulled into the cluster. By issuing a unitstatus command, I can confirm that u1 is inoperable. That would be the drive I just removed and replaced.

# tw_cli info c0 unitstatus

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-10   REBUILDING     90     64K     195.548   OFF    OFF      OFF
u1    RAID-10   INOPERABLE     -      64K     195.548   OFF    OFF      OFF
u2    SPARE     OK             -      -       69.2404   -      OFF      -

Since this drive came from the array, it contains META data that identifies it as part of the array. That META data needs to be erased before the 3Ware controller will accept it as a hot spare. With the following commands, I delete the inoperable unit, add the drive (from that unit) back as a hot spare, and then list the unitstatus.

# tw_cli /c0/u1 del
Deleting /c0/u1 will cause the data on the unit permanently loss.
Do you want to continue ? Y|N [N]: y
Deleting unit c0/u1 ...Done.


# tw_cli /c0 add type=spare disk=1
Creating new unit on controller /c0 ...  Done. The new unit is /c0/u1.

# tw_cli info c0 unitstatus
 
Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-10   OK             -      64K     195.548   ON     OFF      OFF
u1    SPARE     OK             -      -       69.2404   -      OFF      -
u2    SPARE     OK             -      -       69.2404   -      OFF

NetSaint then displaying this status:

All is well.

A final word

RAID is not a backup. Be sure to perform backups on your RAID arrays as you would for any computer without RAID. Monitor your RAID array and watch for problems. Fix them as soon as you can. Following this strategy, your RAID solution should provide you with less down time as a result of any disk failure.

Enjoy.

Need more help on this topic? Click here
This article has no comments
Show me similar articles