The FreeBSD Diary |
(TM) | Providing practical examples since 1998If you buy from Amazon USA, please support us by using this link. |
NetSaint - creating a plug-in for RAID monitoring
11 August 2006
|
This article originally appeared in OnLAMP. In a previous article I talked about my RAID-5 installation. It has been up and running for a few days now. I'm pleased with the result. However, RAID can fail. When it fails, you need to take action before the next failure. Two failures close together, no matter how rare that may be, will involve a complete reinstall1. I have been using NetSaint since first writing about it back in 2001. You will notice that NetSaint development has been continued under a new name - Nagios. For me, I continue to use NetSaint. It does what I need. The monitoring consists of three main components:
With these simple tools, you'll be able to monitor your RAID array. 1For my setup at least. I'm sure that you might know of RAID setups that allow for multiple failures, but mine does not. |
Monitoring the array
|
Monitoring the health of your RAID array is vital to the health of your system. Fortunately, Adaptec has a tool for this. It is available within the FreeBSD sysutils/asr-utils port. After installing the port, it took me a while to figure out what to use and how to use it. The problem was compounded by a run-time error which took me down a little tangent before I could get it running. I will show you how to integrate this utility into your NetSaint configuration. My first few attempts at running the monitoring tool failed with this result: # /usr/local/sbin/raidutil -L all
After some Googling, I found
this reference. The problem
was shared memory. It seems that with PostgreSQL running, # grep SHM /usr/src/sys/i386/conf/LINT options SYSVSHM # include support for shared memory options SHMMAXPGS=1025 # max amount of shared memory pages (4k on i386) options SHMALL=1025 # max number of shared memory pages system wide options SHMMAX="(SHMMAXPGS*PAGE_SIZE+1)" options SHMMIN=2 # min shared memory segment size (bytes) options SHMMNI=33 # max number of shared memory identifiers options SHMSEG=9 # max shared memory segments per process These kernel options are also available as sysctl values: $ sysctl -a | grep shm kern.ipc.shmmax: 33554432 kern.ipc.shmmin: 1 kern.ipc.shmmni: 192 kern.ipc.shmseg: 128 kern.ipc.shmall: 8192 kern.ipc.shm_use_phys: 0 kern.ipc.shm_allow_removed: 0
I stared playing with kill -HUP `cat /usr/local/pgsql/data/postmaster.pid`
Now that $ sudo raidutil -L all RAIDUTIL Version: 3.04 Date: 9/27/2000 FreeBSD CLI Configuration Utility Adaptec ENGINE Version: 3.04 Date: 9/27/2000 Adaptec FreeBSD SCSI Engine # b0 b1 b2 Controller Cache FW NVRAM Serial Status --------------------------------------------------------------------------- d0 -- -- -- ADAP2400A 16MB 3A0L CHNL 1.1 BF0B111Z0B4Optimal Physical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal Logical View Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Reconstruct 94% d0b0t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b1t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal d0b2t0d0 Disk Drive (DASD) ST380011 A 76319MB Replaced Drive d0b3t0d0 Disk Drive (DASD) ST380011 A 76319MB Optimal Address Max Speed Actual Rate / Width --------------------------------------------------------------------------- d0b0t0d0 50 MHz 100 MB/sec wide d0b1t0d0 50 MHz 100 MB/sec wide d0b2t0d0 50 MHz 100 MB/sec wide d0b3t0d0 10 MHz 100 MB/sec wide Address Manufacturer/Model Write Cache Mode (HBA/Device) --------------------------------------------------------------------------- d0b0t0d0 ADAPTEC RAID-5 Write Back / -- d0b0t0d0 ST380011 A -- / Write Back d0b1t0d0 ST380011 A -- / Write Back d0b2t0d0 ST380011 A -- / Write Back d0b3t0d0 ST380011 A -- / Write Back # Controller Cache FW NVRAM BIOS SMOR Serial --------------------------------------------------------------------------- d0 ADAP2400A 16MB 3A0L CHNL 1.1 1.62 1.12/79I BF0B111Z0B4 # Controller Status Voltage Current Full Cap Rem Cap Rem Time --------------------------------------------------------------------------- d0 ADAP2400A No battery Address Manufacturer/Model FW Serial 123456789012 --------------------------------------------------------------------------- d0b0t0d0 ST380011 A 3.06 1ABW6AY1 -X-XX--X-O-- d0b1t0d0 ST380011 A 3.06 1ABEYH4P -X-XX--X-O-- d0b2t0d0 ST380011 A 3.06 1ABRWK0E -X-XX--X-O-- d0b3t0d0 ST380011 A 3.06 1ABRDS5E -X-XX--X-O-- Capabilities Map: Column 1 = Soft Reset Column 2 = Cmd Queuing Column 3 = Linked Cmds Column 4 = Synchronous Column 5 = Wide 16 Column 6 = Wide 32 Column 7 = Relative Addr Column 8 = SCSI II Column 9 = S.M.A.R.T. Column 0 = SCAM Column 1 = SCSI-3 Column 2 = SAF-TE X = Capability Exists, - = Capability does not exist, O = Not Supported The output shows:
It is a subset of this information which we will use to determine whether or
not all is well with the RAID array. Our next task will be experimentation
to determine what
Note: |
Know your RAID
|
I'm sure that each RAID utility will have different responses to different
situation. I am about to investigate what
Normal
Here is what # /usr/local/bin/raidutil -L logical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Optimal Degraded
I shutdown the system, removed the power from one drive, then rebooted.
Here is what # /usr/local/bin/raidutil -L logical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTE RAID-5 228957MB Degraded This is the normal situation when a disk has died, or in this case, been removed from the array.
After I add the disk back in, ReconstructionYou can also useraidutil to start the rebuilding process. This will sync up
the degraded drive with the rest of the array. This can be a lengthy process,
but it is vital.
The rebuilding can be started with this command: /usr/local/bin/raidutil -a rebuild d0 d0b0t0d0
Where
After rebuilding has started, this is what # /usr/local/bin/raidutil -L logical Address Type Manufacturer/Model Capacity Status --------------------------------------------------------------------------- d0b0t0d0 RAID 5 (Redundant ADAPTE RAID-5 228957MB Reconstruct 0% The percentage will slowly creep up until all disks are resynced. |
Using netsaint_statd
|
The scripts supplied with
netsaint_statd . It should be installed on every machine you wish
to monitor. I downloaded the
netsaint_statd
tarball and untar'd it to the directory
/usr/local/libexec/netsaint/netsaint_statd on my
RAID machine. Strictly speaking, the check_*.pl scripts do not need to be on the RAID
machine, only the netsaint_statd . You can remove them if you want. I have them
only on the NetSaint machine.
I use the following script to start it up at boot time: Then I started up the script:$ less /usr/local/etc/rc.d/netsaint_statd.sh #!/bin/sh case "$1" in start) /usr/local/libexec/netsaint/netsaint_statd/netsaint_statd ;; esac exit 0 # /usr/local/etc/rc.d/netsaint_statd.sh start
The RAID machine now has the
This post
by RevDigger is the basis for what I did to set up
I installed the
Now that NetSaint has the tools, you need to tell it about them. I added this
to the end of my # netsaint_statd remote commands
Here are the entries I added to
service[polo]=LOAD;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rload! 3 service[polo]=PROCS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rprocs! service[polo]=USERS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rusers! 4 service[polo]=DISKSALL;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rall_disks Then I restarted Netsaint: After the restart, the I started to see those services in my NetSaint website. This is great!/usr/local/etc/rc.d/netsaint.sh restart |
RAID Notification overview
|
To get NetSaint to monitor my RAID array was not as simple as getting NetSaint
to monitor a regular disk. I was already using
netsaint_statd to
monitor remote machines. I have them all set up so I can see load, process count,
users, and diskspace usage. I will extend This additional feature will involved several distinct steps:
RAID Perl script
As the basis for the perl script, I used check_users.pl as supplied with
If you look at this script, you'll see that we're looking for the 3 major status values: if ($servanswer =~ m%^Reconstruct%) { $state = "WARNING"; $answer = $servanswer; } else { if ($servanswer =~ m%^Degraded%) { $state = "CRITICAL"; $answer = $servanswer; } else { if ($servanswer =~ m%^Optimal%) { $state = "OK"; $answer = $servanswer; } else { $answer = $servanswer; $state = "CRITICAL"; } } } I have decided that Degraded and unknown results will be CRITICAL, Optimal will be OK, and that Reconstruction will be a WARNING.
The next step is to modify netsaint_statd patch
The patch for Now that you have modified the daemon, you need to kill it and restart it:cd /usr/local/libexec/netsaint/netsaint_statd patch < path.to.patch.you.downloaded # ps auwx | grep netsaint_statd root 28778 0.0 0.5 3052 2460 ?? Ss 6:56PM 0:00.32 /usr/bin/perl /usr/local/libexec/netsaint/netsaint_statd/netsaint_statd # kill -TERM 28778 # /usr/local/etc/rc.d/netsaint_statd.sh start # Add RAID to the services monitored by NetSaintNow we have the remote RAID box ready to tell us all about the RAID status. Now it's time to test it. # cd /usr/local/libexec/netsaint/netsaint_statd # perl check_adptraid.pl polo Reconstruct 85% That looks right to me! Now I'll show you what I added to NetSaint to use this new tool.
First, I'll add the service definition to
service[polo]=RAID;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_adptraid.pl I have set up a new notification_group (raid-admins) because I want to be notified via text message to my cellphone when the RAID array has a problem. The contact group I created was: contactgroup[raid-admins]=RAID Administrators;danphone,dan In this case, I want contacts danphone and dan to be notified. Here are the contacts which relate to the above contact group (the lines below may be wrapped, but in NetSaint there should only be two lines): contact[dan]=Dan Langille;24x7;24x7;1;1;0;1;1;0;notify-by-email;host-notify-by-email;dan; contact[danphone]=Dan Langille;24x7;24x7;1;1;0;1;1;0;notify-xtrashort;notify-xtrashort;dan;6135551212@pcs.example.com; This shows that I will be emailed and an email will be sent to my cellphone. After restarting NetSaint, I was able to see this on my webpage: If your RAID is really important to you, then you will definitely want to test the notification via cellphone. I did. I know it works. But I hope it never has to be used. |
Got monitor?
|
I've said it before, and you're going to hear it again. RAID must be monitored if you are to achieve the full benefits of it. By using NetSaint and the above scripts, you should get plenty of time to replace a dead drive before the array is destroyed. That notification alone could save you several hours. Happy RAIDing. |