Monitoring Dell Hardware with Nagios
We use the excellent Nagios network, host and service monitoring software at the office to track the status of our servers, routers, and network devices and connections. The program works great and we love it. However, the one area that we have wanted to track was the status of Dell PowerEdge servers, particularly those running Windows Server 2003. We’ve installed Dell’s OpenManage software on all the boxes and that works great, but we were not getting notified when something on the server failed (power supply, fan, or a disk in an array).
The status of server can be gotten through SNMP to the OpenManage so I knew that it could be done, I just didn’t want to have to reinvent the wheel. I did some searching, and I came across three plugins. The first is simply called check_dell.pl. It is checks the overall health of both the system and the array. If either is non-OK then it gives a warning. It is simple, quick, and effective, but I wanted additional reporting so that I know what was component was actually faulty.
The second plugin is called check_om.py and it checks the overall chassis status. If it is non-OK, it will then check other status indicators in order to create an error message that indicates where the problem lies. It has the ability to check for power supply, voltage, cooling device, temperature, memory, and intrusion issues. It works great, and we now us it!
Now I needed to find a way to report on the status of the drive arrays because the check_om.py doesn’t do that. I found a couple of plugins that would check the RAID controller locally or would do it for Linux servers. Then I finally found this check_win_perc plugin posted on a Dell mailing list site. It has a number of really good features, like telling which drive in the RAID array was having problems, but it also has some quirks. For one thing it stores baseline information in a temp that must be manually deleted. In order to work in our environment it needed some clean and modification.
I modified the plugin to better handle passing of SNMP community strings. As it was originally written it reported all the disks and their status, no matter to which array controller it might be attached. I modified the code so that you can select which of two controllers you want to monitor and report on only those disks. Because my coding skills are non-existent, it still has some unresolved quirks, like when it reports the number of Global Hot Spares it is still doing it across all controllers which is wrong.
My modified code is listed below. Please use at your own risk! If you make any modifications or enhancements please let me know.
#!/bin/bash
#
# Script to check the Windows Dell-PERC for current status
#
# Original by: Lewis Getschel
# Modified by: Ken Nerhood
# Date: 05/11/2005
# Parameters: 1 - the IP address of the system to check
# 2 - snmp community string
# 3 - controller num (from .1.3.6.1.4.1.674.10893.1.1.130.1.1.1)
#
# Version History:
# 12/29/2004 Keeping a temp file seemed the best way to go on this. This
# LG allows seeing changes. I initially didn't show the number of
# Global/Dedicated HotSpares, but I realized that since each
# "at-that-time-purchased" group had different standards for how
# they were configured I needed to see the actual numbers of spares
#
# Notes: The "baseline" (the temp file) is never actually replaced
# anywhere in this code. If a new baseline is desired, then
# simply delete the appropriate temp file. This routine will
# create a NEW baseline (/tmp) file, and use that onward.
#
# Additional note:
# Whenever something changes on the array (ready to offline, etc)
# 2 things happen:
# 1) Nagios goes to critical state
# 2) Nagios will STAY that way until you delete (or rename) the
# 'baseline' file in /tmp
# I just leave it that way until the new drive arrives, then
# I delete the file. I let the "new config" be the Warning
# state for the 1st check, that way it shows up better in the
# event log.
#
#
# 05/11/2005 Added additional parameters to allow for easier configuration.
# KBN You need to specify which array controller you want to monitor,
# currently the script will only handle 2 controllers.
# The script will now return a warning state if the controller
# reports a severity level differnt than OK. This is to handle
# the case where the baseline matches, but controller is not yet
# OK (i.e. when rebuilding)
#
#
# =================================== Script starts below ================================
#
systemdifferences=0
hostnam=$1
communitystring=$2
arraynum=$3
# echo $1 >> /tmp/nagios_event_debug.txt
# echo --- `date` --- >> /tmp/nagios_event_debug.txt
if [ "$#" -lt "3" ]; then
echo "Useage: check_win_perc host community arraynumber"
exit 3
fi
# these system status's don't hold after a reboot!
currentsystemstatus=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.5.$arraynum | awk '{print $NF}'`
previoussystemstatus=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.5.$arraynum | awk '{print $NF}'`
system_serial_number=`snmpwalk -v 1 -c $communitystring $hostnam .1.3.6.1.4.1.674.10892.1.300.10.1.11 | awk '{print $NF}' | sed 's/\"//g'`
if [ $arraynum -eq "1" ]; then
contl1severity=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.6.$arraynum | awk '{print $NF}'`
contl1drives=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.5.1.7 | awk '{print $NF}' | awk 'BEGIN {x=0} /'$arraynum'/ {++x} END {print x}'`
contl1name=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.2.$arraynum | awk -F\" '{print $2}'`
for ((a=1; a < = $contl1drives ; a++)) # Double parentheses, and "total_drives" with no "$".
do
current_disks_state[${a}]=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.4.${a} | awk '{print $NF}'`
done # A construct borrowed from 'ksh93'.
# === if there is a previousdata file for previous run, read it in.
if [ -e /tmp/${hostnam}_${arraynum}_$system_serial_number.txt ]; then
for ((a=1; a <= contl1drives ; a++)) # Double parentheses, and "total_drives" with no "$".
do
previous_disks_state[${a}]=`/bin/sed -ne ${a}p /tmp/${hostnam}_${arraynum}_$system_serial_number.txt`
done
previousdata=1
else # no previous file data, make it now from current (or should I make it manually as 4 3 3 3 1 ..??)
currentdrive=1
previousdata=0
/bin/touch /tmp/${hostnam}_${arraynum}_$system_serial_number.txt
while [ $currentdrive -le $contl1drives ]
do
echo ${current_disks_state[$currentdrive]} >> /tmp/${hostnam}_${arraynum}_$system_serial_number.txt
currentdrive=`expr $currentdrive + 1`
done
echo "WARNING - PERC array wrote first status file for /tmp/${hostnam}_${arraynum}_$system_serial_number"
exit 1
fi
totalhotspares=`/usr/bin/snmpwalk -c $communitystring -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /3/ {++x} END {print x}'`
#totaldedicatedspares=`/usr/bin/snmpwalk -c $communitystring -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /4/ {++x} END {print x}'`
# ========= If current status != previous status then it's Broken, figure out where =============
# except for the FIRST time this script runs, this code only runs because of a mismatch in states
# it seems safe to assume that I should check each array position for where the problem is.
currentdrive=1
while [ $currentdrive -le $contl1drives ]
do
if [ ${current_disks_state[$currentdrive]} -ne ${previous_disks_state[$currentdrive]} ]; then
systemdifferences=1
echo -n `/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.2.$currentdrive | awk -F\" '{print $2}'`" "
case "${current_disks_state[$currentdrive]}" in
"0" )
echo -n "Unknown";;
"1" )
echo -n "Ready"
case "`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.22.$currentdrive | awk '{print $NF}'`" in
"1" )
echo -n "-member of virtual disk.";;
"2" )
echo -n "-member of disk group.";;
"3" )
echo -n "-global hot spare.";;
"4" )
echo -n "-dedicated hot spare.";;
* )
echo -n "Bad_ERROR_Code.";;
esac;;
"2" )
echo -n "Failed";;
"3" )
echo -n "Online";;
"4" )
echo -n "Offline";;
"6" )
echo -n "Degraded";;
"7" )
echo -n "Recovering";;
"11" )
echo -n "Removed";;
"15" )
echo -n "Resyncing";;
"24" )
echo -n "Rebuild";;
"25" )
echo -n "No Media";;
"26" )
echo -n "Formatting";;
"28" )
echo -n "Diagnostics";;
"35" )
echo -n "Initializing";;
* )
echo -n "Bad_ERROR_Code";;
esac
echo -n " Was: "
case "${previous_disks_state[$currentdrive]}" in
"0" )
echo -n "Unknown. ";;
"1" )
echo -n "Ready. ";;
"2" )
echo -n "Failed. ";;
"3" )
echo -n "Online. ";;
"4" )
echo -n "Offline. ";;
"6" )
echo -n "Degraded. ";;
"7" )
echo -n "Recovering. ";;
"11" )
echo -n "Removed. ";;
"15" )
echo -n "Resyncing. ";;
"24" )
echo -n "Rebuild. ";;
"25" )
echo -n "No Media. ";;
"26" )
echo -n "Formatting. ";;
"28" )
echo -n "Diagnostics. ";;
"35" )
echo -n "Initializing. ";;
* )
echo -n "Bad_ERROR_Code. ";;
esac
fi
currentdrive=`expr $currentdrive + 1`
done
if [ $systemdifferences -eq 0 ];
then
case $contl1severity in
"0" )
echo "OK - $contl1name Drives=$contl1drives, Global HotSpares=$totalhotspares"; exit 0;;
"1" )
echo "Warning - $contl1name Controller"; exit 1;;
"2" )
echo "Error - $contl1name Controller"; exit 2;;
"3" )
echo "Failure - $contl1name Controller"; exit 2;;
esac
else
echo ""
exit 2
fi
else
contl2severity=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.6.$arraynum | awk '{print $NF}'`
contl2name=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.1.1.2.$arraynum | awk -F\" '{print $2}'`
contl1drives=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.5.1.7 | awk '{print $NF}' | awk 'BEGIN {x=0} /1/ {++x} END {print x}'`
contl2drives=`/usr/bin/snmpwalk -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.5.1.7 | awk '{print $NF}' | awk 'BEGIN {x=0} /'$arraynum'/ {++x} END {print x}'`
d=$contl1drives
for ((a=1; a < = $contl2drives ; a++)) # Double parentheses, and "total_drives" with no "$".
do
let d=$contl1drives+$a
current_disks_state[${a}]=`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.4.$d | awk '{print $NF}'`
done # A construct borrowed from 'ksh93'.
# === if there is a previousdata file for previous run, read it in.
if [ -e /tmp/${hostnam}_${arraynum}_$system_serial_number.txt ]; then
for ((a=1; a <= contl2drives ; a++)) # Double parentheses, and "total_drives" with no "$".
do
previous_disks_state[${a}]=`/bin/sed -ne ${a}p /tmp/${hostnam}_${arraynum}_$system_serial_number.txt`
done
previousdata=1
else # no previous file data, make it now from current (or should I make it manually as 4 3 3 3 1 ..??)
currentdrive=1
previousdata=0
/bin/touch /tmp/${hostnam}_${arraynum}_$system_serial_number.txt
while [ $currentdrive -le $contl2drives ]
do
echo ${current_disks_state[$currentdrive]} >> /tmp/${hostnam}_${arraynum}_$system_serial_number.txt
currentdrive=`expr $currentdrive + 1`
done
echo "WARNING - PERC array wrote first status file for /tmp/${hostnam}_${arraynum}_$system_serial_number"
exit 1
fi
totalhotspares=`/usr/bin/snmpwalk -c $communitystring -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /3/ {++x} END {print x}'`
#totaldedicatedspares=`/usr/bin/snmpwalk -c $communitystring -v 1 $hostnam 1.3.6.1.4.1.674.10893.1.1.130.4.1.22 | awk '{print $NF}'| awk 'BEGIN {x=0} /4/ {++x} END {print x}'`
currentdrive=1
while [ $currentdrive -le $contl2drives ]
do
if [ ${current_disks_state[$currentdrive]} -ne ${previous_disks_state[$currentdrive]} ]; then
systemdifferences=1
let c2currentdrive=$contl1drives+$currentdrive
echo -n `/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.2.$c2currentdrive | awk -F\" '{print $2}'`" "
case "${current_disks_state[$currentdrive]}" in
"0" )
echo -n "Unknown";;
"1" )
echo -n "Ready"
case "`/usr/bin/snmpget -v1 -c $communitystring $hostnam\:161 1.3.6.1.4.1.674.10893.1.1.130.4.1.22.$c2currentdrive | awk '{print $NF}'`" in
"1" )
echo -n "-member of virtual disk.";;
"2" )
echo -n "-member of disk group.";;
"3" )
echo -n "-global hot spare.";;
"4" )
echo -n "-dedicated hot spare.";;
* )
echo -n "Bad_ERROR_Code.";;
esac;;
"2" )
echo -n "Failed";;
"3" )
echo -n "Online";;
"4" )
echo -n "Offline";;
"6" )
echo -n "Degraded";;
"7" )
echo -n "Recovering";;
"11" )
echo -n "Removed";;
"15" )
echo -n "Resyncing";;
"24" )
echo -n "Rebuild";;
"25" )
echo -n "No Media";;
"26" )
echo -n "Formatting";;
"28" )
echo -n "Diagnostics";;
"35" )
echo -n "Initializing";;
* )
echo -n "Bad_ERROR_Code";;
esac
echo -n " Was: "
case "${previous_disks_state[$currentdrive]}" in
"0" )
echo -n "Unknown. ";;
"1" )
echo -n "Ready. ";;
"2" )
echo -n "Failed. ";;
"3" )
echo -n "Online. ";;
"4" )
echo -n "Offline. ";;
"6" )
echo -n "Degraded. ";;
"7" )
echo -n "Recovering. ";;
"11" )
echo -n "Removed. ";;
"15" )
echo -n "Resyncing. ";;
"24" )
echo -n "Rebuild. ";;
"25" )
echo -n "No Media. ";;
"26" )
echo -n "Formatting. ";;
"28" )
echo -n "Diagnostics. ";;
"35" )
echo -n "Initializing. ";;
* )
echo -n "Bad_ERROR_Code. ";;
esac
fi
currentdrive=`expr $currentdrive + 1`
done
if [ $systemdifferences -eq 0 ];
then
case $contl2severity in
"0" )
echo "OK - $contl2name Drives=$contl2drives, Global HotSpares=$totalhotspares"; exit 0;;
"1" )
echo "Warning - $contl2name Controller"; exit 1;;
"2" )
echo "Error - $contl2name Controller"; exit 2;;
"3" )
echo "Failure - $contl2name Controller"; exit 2;;
esac
else
echo ""
exit 2
fi
fi
Tags: dell, hardware, monitoring, nagios, opensource, plugin
You can comment below, or link to this permanent URL from your own site.
April 4, 2006 at 9:36 am
Great post, very nice plugins! Thanks man, you saved me at least a few hours of searching

I also really like the post on Nagiosgraph… You totally changed my opinion on blogs
April 5, 2006 at 1:14 pm
I’m glad that you found what I did helpful. In both cases (with the Dell plugins and the Nagiosgraph) it has been a long time since I’ve even looked at the code. It just works for us. Hopefully it will work as reliably for you. If you make any modifications please let me know.
May 3, 2006 at 6:20 am
Hi,
i’d like to use the check_dell.pl but unfortunately i can’t download the plugin from sourceforge.
Can you please e-mail the plugin to me.
Thank you very much.
Alex
May 3, 2006 at 8:28 am
Alex,
The plugin is not on sourceforge, but my site. It is listed above at the very bottom of the post, but I’ll list it here as well. Download the code check_win_perc. I’ll also email the code as well.
I hope it works for you.
February 20, 2007 at 5:42 pm
I’m having an issues running the code above. For some reason the version of dell openmanage software 4.x doesnt seem to like the snmp oid .1.3.6.1.4.1.674.10893.1.1.130.1.1.1
Has this been updated to reflect the newer version of dell openmanage?
Thanks!
February 20, 2007 at 5:54 pm
Electro,
I’m running successfully running this on new (last 6 months) Dell PowerEdge 1850 w/ OM version 4.5 without any problems. It has been well over a year since I’ve looked at this code. So its all a little rusty for me, but has been fine across my 18 different servers.
Can you get anything off the Dell MIB variables when querying them? Let me ask a dumb questions, if you are running Windows on the box do you have SNMP service started and configured to allow requests from your management/nagios station. Do you have the firewall running and have the right ports open?
Let me know how it goes.
–ken
February 21, 2007 at 3:38 pm
Hi, I would like to use the check_dell.pl script, but unfortunately, I cannot find it. The link above (in the post) points to source forge - which is inaccessible to non-members of the project.
Could you kindly mail it to me, or point me to an alternative resource from where I could get it?
Thanks,
Oscar
October 7, 2007 at 9:19 pm
Hi, to second that.
the scripts are no longer available. are you able to mirror them or send them to me?
Cheers.
October 11, 2007 at 8:24 am
If you’re looking for some of the scripts that I mentioned in the original article, you may want to check out NagiosExchange. They have tons of stuff, including an entire section dell hardware monitoring.
March 25, 2008 at 5:22 pm
Hi Ken,
I just came across your code. I have nagios server running on Linux and put your code there. My client a Dell 2650 running Win2k3 enterprise SNMP enabled, OpenManage Server running and everything. When I inquire running your code I get:
Error in packet
Reason: (noSuchName) There is no such variable name in this MIB.
Failed object: SNMPv2-SMI::enterprises.674.10893.1.1.130.1.1.5.1
Any ideas?
April 17, 2008 at 1:35 am
[...] Monitoring Dell Hardware with Nagios at Nerhood Weblog (tags: Nagios dell management monitoring) [...]