The FreeBSD Diary |
(TM) | Providing practical examples since 1998If you buy from Amazon USA, please support us by using this link. |
3Ware Nagios plugin
3 September 2010
|
I use Nagios to monitor my servers and work stations. If something goes wrong, I usually get told by Nagios before I notice the problem myself. A week or so back, I noticed a rather odd RAID problem. Eventually, the problem was solved by upgrading the firmware on the controller. In the meantime, I had located and installed a Nagios 3ware plugin. I like it and I'm using it on more than one server. However, now that I turned on AUTO-VERIFY, I've found a spot where I can improve the plugin. |
Verifying...!
|
Earlier today, I turned on AUTO-VERIFY for this controller. Tonight, Nagios is reporting: Status: UNKNOWN Status Information: UNKNOWN: /c0/u0 RAID-10 VERIFYING - 56% 64K 195.548 ON ON - /c0/u1 SPARE VERIFYING - 0% - 69.2404 - ON - /c0/u2 SPARE VERIFYING - 0% - 69.2404 - ON - If I look at the status output, I see: $ sudo /usr/local/sbin/tw_cli info c0 u0 Password: Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB) ------------------------------------------------------------------------ u0 RAID-10 VERIFYING - 62% - 64K 195.548 u0-0 RAID-1 VERIFYING 62% - - - - u0-0-0 DISK OK - - p0 - 65.1826 u0-0-1 DISK OK - - p2 - 65.1826 u0-1 RAID-1 VERIFYING 62% - - - - u0-1-0 DISK OK - - p6 - 65.1826 u0-1-1 DISK OK - - p5 - 65.1826 u0-2 RAID-1 VERIFYING 63% - - - - u0-2-0 DISK OK - - p3 - 65.1826 u0-2-1 DISK OK - - p4 - 65.1826 u0/v0 Volume - - - - - 195.548 Now I'd rather have something other than UNKNOWN. Fortunately, I have the source. |
The patch!
|
This is the patch: --- /usr/local/libexec/nagios/check_3ware.sh 2010-08-27 02:34:55.000000000 +0100 +++ /home/dan/bin/check_3ware.sh 2010-09-02 01:08:39.000000000 +0100 @@ -66,6 +66,12 @@ MSG="$MSG $STATUS -" PREEXITCODE=1 ;; + VERIFYING) + CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3,$5}'` + STATUS="/$i/$CHECKUNIT" + MSG="$MSG $STATUS -" + PREEXITCODE=1 + ;; DEGRADED) CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3}'` STATUS="/$i/$CHECKUNIT"This is what it outputs: $ sudo ~/bin/check_3ware.sh WARNING: /c0/u0 VERIFYING 89% - /c0/u1 VERIFYING 0% - /c0/u2 VERIFYING 0% - After replacing the original script, I get this output when testing it from the command line on the Nagios server: $ /usr/local/libexec/nagios/check_nrpe2 -H supernews-vpn -c check_3ware.sh WARNING: /c0/u0 VERIFYING 99% - /c0/u1 VERIFYING 1% - /c0/u2 VERIFYING 0% - I now see this on my Nagios webpage: Status: WARNING Status Information: WARNING: /c0/u0 VERIFYING 99% - /c0/u1 VERIFYING 1% - /c0/u2 VERIFYING 0% - |
Other ideas
|
Tonight I started a battery test. The status immediately went to CRITICAL. That got me thinking about this patch: $ diff -ruN /usr/local/libexec/nagios/check_3ware.sh ~/bin/check_3ware.sh --- /usr/local/libexec/nagios/check_3ware.sh 2010-09-02 01:08:39.000000000 +0100 +++ /home/dan/bin/check_3ware.sh 2010-09-02 02:52:39.000000000 +0100 @@ -100,7 +100,7 @@ # Check BBU's BBU=(`$TWCLI info $i |${GREP} -E "^bbu"|${AWK} '{print $1,$2,$3,$4,$5}'`) if [ "${BBU[0]}" = "bbu" ]; then - if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" ] || [ "${BBU[4]}" != "OK" ]; then + if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" && "${BBU[3]}" != "Testing" ] || [ "${BBU[4]}" != "OK" ]; then BBUEXITCODE=2 BBUERROR="BBU on $i failed" fi I also think I may change the status for VERIFYING from WARNING to OK, because really, everything IS OK. The controller is merely running VERIFY. FYI: I sent an email to the plugin author before I published this. |