HOWTO: Modifying Nagios HostGroup Services to find system issues
Let’s assume for a moment that you have a working Nagios configuration. You might be monitoring all of a domain, such as a TEST domain, and using a HOSTGROUP of TST_SERVERS to assign SERVICES to all hosts. By adding a default check to the entire system, you can start looking for anomalies.
Scenario:
You want to confirm that Trend Micro Office Scan has been successfully deployed to all systems. Equally you want to ensure that no one has disabled, stopped, or otherwise affected the services.
1) Login to Nagios XI Core Configuration Manager as NagiosAdmin
2) Click on SERVICES on the LEFT. Filter for TST to find all existing TST_* services.
Click CLONE next to “TST_SERVER_SVC_VMTOOLS”.
3) Find the name of the Windows Service you want to monitor – such as “NTRTSCAN” to monitor “OfficeScan NT RealTime Scan”.
4) Modify the service:
Modify the service to include “NTRTSCAN” and modify the DESCRIPTION/DISPLAY and check “ACTIVE”. Click SAVE. Then click APPLY CONFIGURATION.
5) Now, if you search for “Service – Office”:
You can see which hosts don’t have the service running. Those would be the systems you want to target.
HOWTO: Mass configure Nagios for advanced monitoring
As you’re aware, Nagios is a pretty decent and freely available network monitoring tool. Most people aren’t away of the best way to use it or configure it in bulk. I have an opportunity to add a entire TEST domain to be monitored (but not alerted on), and this seems like a great time to do up some documentation.
We’re going to make the following assumptions:
- These are all Windows Server hosts of some sort – 2003/2008/2008R2/2012/2012R2 – maybe 2000, who knows.
- These can all run the Nagios NSclient++ – which can be distributed via GPO based MSI installation and INI file creation if needed.
- The systems are generally fairly standardized, with the following standards:
- Disks are C: OS and E: for Data, with D: for Optical
- SQL Servers have an H: for System DB, I: for User DB, and L: for Logs
- IIS Servers exist, and have INETPUB on E:\
To accommodate this we’re going to make use of the HOSTS, HOSTGROUPS, SERVICES, and SERVICEGROUPS in Nagios. When we do this, we’ll see the following:
- Services are created – C:, E:, RAM, CPU, etc.
- Services are added to ServiceGroups
- ServiceGroups would be like SERVER_BASE, SERVER_SQL, and SERVER_IIS
- Hosts are added for each host
- Host are added to HostGroups
- 3 SQL servers might be added to the SERVER_SQL host group
- HostGroups would be like SERVER_SQL would contain a base Service Group of SERVER_BASE and SERVER_SQL
- We’re going to preface these with “TST” to specify these settings are for the TST domain. We could use a common one across all environments, but then we wouldn’t be able to modify the TST ones in advance to validate changes prior to promoting into Production.
1) Login to NagiosXI as your normal user
2) Click on CONFIGURE on the top, then CORE CONFIG MANAGER on the left, and then login as “nagiosadmin”.
3) On the left, under MONITORING click HOST GROUPS.
Click ADD NEW.
Give your HOSTGROUPNAME and DESCRIPTION. Here we’re going to create HOSTGROUPS for “TST_SERVER”, “TST_SERVER_DC”, “TST_SERVER_SQL”, etc.
4) Click SERVICE GROUPS under MONITORING on the left. Then click ADD NEW.
5) Name the Service Group and give it a description. Click SAVE.
Repeat this for your other service groups – presumably TST_SERVER_SQL and TST_SERVER_IIS as examples.
Click on the APPLY CONFIGURATION button.
6) Next, we’ll add some Services. We’ll make the assumption that all previously existing services are added on a 1:1 basis to hosts, and thus aren’t really what we want to use, other than perhaps as a template. Click on SERVICES under MONITORING on the left. Search for “CPU” to find an existing CPU sensor if one exists. Click on it to edit it:
Here you can see how the service is configured – we’re running the “check_xi_service_nsclient” command, using arguments where $ARG1$ is a password hash for nsclient, CPULOAD is the snsor, and “-l 5,85,95” indicates to check 5 minutes average, with a warning at 85%, and critical at 95%. The most important thing here is that your NSCLIENT hash matches what is on your systems. Click ABORT.
Click ADD NEW:
Configure the CONFIG NAME to specify that it belongs to TST_SERVER and call it _CPU. The Description you likely want to be something that will display in a human readable format. Enter your ARG’s as shown. Click TEST CHECK COMMAND.
Enter the hostname of a system to check as a test. Click OK
Verify you get OUTPUT and click close.
Click on the CHECK SETTINGS tab:
Change the INITIAL STATE to U for “UP” – this will let it assume it is good, until it knows otherwise.
Change the CHECK INTERVAL to 5, RETRY INTERVAL to 1, and MAX CHECK to 5. Change the CHECK PERIOD to 24×7.
Click COMMON SETTINGS tab again.
Click MANAGE SERVICEGROUPS:
Click the SERVICE GROUP to add to, and click ADD SELECTED. You’ll see it show up under ASSIGNED. Click CLOSE. Click SAVE.
Enter TST_SERVER in the search box. You should now find your new config. Click COPY and then EDIT the copy to configure additional services using the “check_xi_service_nsclient” command, with the following options:
- CPULOAD / -l 5,85,95
- MEMUSE / -w 80,95
- USEDDISKSPACE / -l C –w 80 –c 90
- SERVICESTATE / -d SHOWALL –l VMTOOLS
- UPTIME
Ensure when copying and modifying that you check the ACTIVE box!
Click MANAGE HOSTGROUPS.
Add your SERVER GROUP and click CLOSE.
You should now have a number of services:
Click APPLY CONFIGURATION.
7) Click on the left under MONITORING and click HOSTS.
Click ADD NEW:
Enter the HOST NAME/DESCRIPTION/ADDRESS/DISPLAY NAME. I like to use UPPER CASE for the HOSTNAME and lower case for the FQDN portion. While you can use shorter names and only use FQDN for the “address” field to find the host, consider a situation where you may have “WSUS.PROD.LOCAL”, “WSUS.TEST.LOCAL”, and “WSUS.DEV.LOCAL” – if you get an alert for “WSUS”, which host is it? Ensure you select a basic CHECK COMMAND, something like “check_xi_service_ping” or “check-host-alive” for a basic ping service. Click ACTIVE. Click on the CHECK SETTINGS tab:
Change the INITIAL STATE to U, CHECK INTERVAL to 5, RETRY INTERVAL to 1, MAX CHECK ATTEMPTS to 1, and change CHECK PERIOD to 24×7. Click COMMON SETTINGS tab.
Click MANAGE HOSTGROUPS
Find the HOST GROUP and click ADD SELECTED. Click CLOSE. Then click SAVE.
8) If you now go back to NAGIOS itself, you can QUICK FIND for “FSRVTST” to find all hosts with this substring:
Here you can see how the two DC’s not only have the TST_SERVER services, but also have the TST_SERVER_DC services. Note that the ones that are NOT a member of TST_SERVER_DC hostgroup, do not show the services assigned to that HOST GROUP.
So from here what do you do:
To add/modify SERVICES to a HOST GROUP:
- Add new SERVICES
- Assign the SERVICES to a HOST GROUP
To add new HOST GROUPS:
- Create any new HOST GROUPS as needed.
- Add new SERVICES
- Assign the SERVICES to a HOST GROUP
To add new HOSTS:
- Add new hosts, in bulk, and add them to the HOST GROUP – and the services and settings are all done.
You can see how now that these templates are created, it would be very simple to create monitoring by policy. Suppose you want to change the Warning/Critical limits from 80/90% on disk space to 85/95 – you now change the service assigned to the HOST GROUP, and you’re finished – whether it is 1 or 100 hosts.
In order for this all to work though, you MUST HAVE STANDARDS. For example, I noted that servers in this environment have a C: and E: drive, and D: is optical. So we can see one of our servers has a “code 139 out of bounds” on E: drive:
When we check the server, we see very clearly, what should be E:, is D:. One could suggest you simply modify Nagios. However, the CORRECT plan of action would be to FIX the DEVIATION. If you do not, other scripts, assumptions, monitoring, tools, etc will ALSO be incorrect. So if nothing else, you may utilize this method of monitoring to help you locate deficiencies.
HOWTO: Install and distribute Nagios NSCP Agent to all hosts for monitoring with Nagios XI
This HOWTO will cover using a batch file and PSexec to distribute the Nagios NSCP v0.4.1.101 agent to all systems and configure it for use.
The following assumptions are made:
• An existing NSC.INI exists – possibly from a previous site or configuration, configured as desired.
o Configuration of the NSC.INI is not in scope for this document. This document covers distribution and installation of a known working configuration
• You will be using the NSCP v0.4.1.101 MSI packages from:
o http://www.nsclient.org/files/stable//NSCP-0.4.1.101-x64.msi
o http://www.nsclient.org/files/stable//NSCP-0.4.1.101-Win32.msi
• The Windows Firewall will require an exception to allow the NSCP.exe to communicate
• You need to support both 32 and 64 bit environments
• As always, as the BAT is referencing a network shared MSI file, you MUST call PSEXEC with a username and password, or it will not accurately find the MSI package.
===== INSTALL_NAGIOS.BAT =====
@echo off
REM
REM PSEXEC Usage:
REM E:\ADMIN\BIN\PSEXEC.EXE \\SERVER –u DOMAIN\user -h -f -d -c \\FSRVTSTWSUS1\INSTALLS\NSCLIENT\INSTALL_NSCLIENT.BAT
REM
REM
REM Set variables to make updates easier
REM
set INSTALL_SERVER=FSRVTSTWSUS1
set INSTALL_SHARE=INSTALLS
set INSTALL_FOLDER=NSCLIENT
set INSTALL_LOG=INSTALL_NSCLIENT.LOG
REM
REM Check 32/64bit
REM
if %PROCESSOR_ARCHITECTURE% == AMD64 goto 64BIT
goto 32BIT
:64BIT
echo Installing Nagios nscp 64bit on %COMPUTERNAME% at %DATE% %TIME% >> \\%INSTALL_SERVER%\%INSTALL_SHARE%\%INSTALL_FOLDER%\%INSTALL_LOG%
msiexec /i \\%INSTALL_SERVER%\%INSTALL_SHARE%\%INSTALL_FOLDER%\v0.4.1.101\64bit\NSCP-0.4.1.101-x64.msi /qb
goto COMMON
:32BIT
echo Installing Nagios nscp 32bit on %COMPUTERNAME% at %DATE% %TIME% >> \\%INSTALL_SERVER%\%INSTALL_SHARE%\%INSTALL_FOLDER%\%INSTALL_LOG%
msiexec /i \\%INSTALL_SERVER%\%INSTALL_SHARE%\%INSTALL_FOLDER%\v0.4.1.101\32bit\NSCP-0.4.1.101-win32.msi /qb
goto COMMMON
:COMMON
REM
REM netsh firewall is deprecated, but referenced just in case
REM netsh firewall add allowedprogram "C:\Program Files\nsclient++\nscp.exe" "Nagios NSCP Agent" enable
REM
REM Open the firewall for NSCP Agent
REM
netsh advfirewall firewall add rule name="Nagios NSCP Agent" dir=in action=allow program="C:\Program Files\nsclient++\nscp.exe" profile=Domain
REM
REM Copy the INI file in
REM
xcopy \\%INSTALL_SERVER%\%INSTALL_SHARE%\%INSTALL_FOLDER%\NSC.ini "C:\Program Files\nsclient++\" /s/e/c/k/i/y
REM
REM Install the service and start it
REM
"C:\Program Files\nsclient++\nscp.exe" service –install
"C:\Program Files\nsclient++\nscp.exe" service –stop
"C:\Program Files\nsclient++\nscp.exe" service –start
goto END
:END
HOWTO: Correct vSphere ESXi "the ramdisk ‘tmp’ is full"
Tonight we came across an odd error on one of the hosts. We use Nagios for monitoring, and only one host was exhibiting an error checking networking on the host:
Can’t call method "network" on an undefined value at ./check_esx3.pl line 865
Checking the host EVENTS tab, did show the following errors:
The following VMware KB article gives details on the issue:
Following the suggestion from the KB, we see that sure enough, the ramdisk for tmp IS full.
Appears to be the /tmp/scratch/downloads folder in question.
Don’t suppose at 1:25PM the logs were exported by any chance? That’s the only supposition I have at the moment.
I can confirm at the Nagios console that it is having isssues with just that host. Note that the Perl method Nagios uses for checking is horrible, as it logs into the host as root to get its stats. I’d fix it to use SOAP or vSphere API’s, but “Nagios is going away” so it seems like I’d just be wasting effort to invest in Nagios.
After removing the offending file in /tmp/scratch/downloads, Nagios is able to run its checks.
Now – does anyone know how to find out WHY it was full?
HOWTO: Audit NAGIOS for inclusion of all appropriate hosts
From time to time, Nagios (or any monitoring software) will become out of sync with installed systems. The automated method of resolution for this is periodic network scans. However, in many networks, if not designed to accommodate this, will generate many false positives (eg: secondary IP’s, logical host names, DNS names without reverse DNS, cluster/heartbeat NIC’s, etc.). The alternative is to do a compare against Nagios and a known list. This is the method I will describe here.
1) Login to Nagios via SSH on SERVERNMS1. Note you must use a non-root account first, and then “su –“ to become root after.
2) Change to the /usr/local/Nagios/etc/hosts folder (cd /usr/local/nagios/etc/hosts)
[root@servernms1 hosts]# cd /usr/local/nagios/etc/hosts
[root@servernms1 hosts]# pwd
/usr/local/nagios/etc/hosts
3) As Nagios is not configured to either see or share SMB (CIFS) files, we will need to get our list via the console. Perform an “ls –la” to get the full list of files. Then copy the screen text to Excel:
[root@servernms1 hosts]# ls -la
total 1440
drwsrwsr-x 2 apache nagios 20480 Sep 24 12:01 .
drwxrwsr-x 8 apache nagios 4096 Sep 21 08:01 ..
-rw-rw-r– 1 apache nagios 970 Sep 25 13:29 10.32.0.18.cfg
-rw-rw-r– 1 apache nagios 1017 Sep 25 13:29 10.8.0.10.cfg
-rw-rw-r– 1 apache nagios 994 Sep 25 13:29 10.8.0.1.cfg
-rw-rw-r– 1 apache nagios 1240 Sep 25 13:29 barracuda1.cfg
-rw-rw-r– 1 apache nagios 1240 Sep 25 13:29 barracuda2.cfg
-rw-rw-r– 1 apache nagios 1123 Sep 25 13:29 brn-2901-isr.servercorp.ca.cfg
-rw-rw-r– 1 apache nagios 1161 Sep 25 13:29 brn-2960-01.servercorp.ca.cfg
-rw-rw-r– 1 apache nagios 1053 Sep 25 13:29 brn-ups-01.servercorp.ca.cfg
-rw-rw-r– 1 apache nagios 1296 Sep 25 13:29 CAA-ACCESS01.cfg
-rw-rw-r– 1 apache nagios 1298 Sep 25 13:29 CAA-ACCESS02.cfg
-rw-rw-r– 1 apache nagios 1298 Sep 25 13:29 CAA-ACCESS03.cfg
-rw-rw-r– 1 apache nagios 1306 Sep 25 13:29 CAA-ACCESS04.cfg
-rw-rw-r– 1 apache nagios 1207 Sep 25 13:29 CAA-ACCESS05.cfg
-rw-rw-r– 1 apache nagios 1270 Sep 25 13:29 CAA-DIST01.cfg
-rw-rw-r– 1 apache nagios 1315 Sep 25 13:29 CAA-DIST02.cfg
-rw-rw-r– 1 apache nagios 1228 Sep 25 13:29 CAB-ACCESS-02.cfg
-rw-rw-r– 1 apache nagios 1291 Sep 25 13:29 CAC-ACCESS01.cfg
NOTE: We have no interest in the first 4 lines – the console prompt, the “total ###” or the “.” or “..” folders.
Once pasted into Excel, it should auto break the text into columns:
4) Highlight the column with the names (I in this example). Click on TEXT TO COLUMNS on the toolbar.
Set the DELIMITER to be “.” (period). This will break apart <servername> and .CFG. Note that many of the devices listed are (incorrectly, in my opinion) listed by IP address, so it will in fact mungle those items.
You will be left with a list that looks like:
5) Delete all columns other than I to leave only the hostnames that Nagios is configured for.
6) Get a list of servers that should be present. There are many ways to do this, querying AD, querying vCenter, performing a Net View, etc. I personally prefer a “Net View”.
Run: “net view | find /I “FSRV” >>SERVERS.LST” to get a list output to a text file.
The output will look similar to:
As you can see, we need to remove both the “\\” as well as the “Description” field. Paste these contents into the XLS file in Column C.
7) Now that the contents are in Excel, we can use Text to Columns again to fix this data:
Highlight Column C: and click TEXT TO COLUMNS.
Choose SPACE as a delimiter as well as OTHER = “\”. Click FINISH.
Delete columns B, C, E-I or more. (Leave only Columns A and D).
8) Now that you have the two columns – A= “In Nagios” and B=”On Network”, we can compare them.
Column C: then needs the formula: =IF(ISERROR(MATCH(B1,A:A, 0)), "No Match", "Match")
This formula basically says “Take the value in B#, and look for it in Column A. If it is found, enter “Match” and if not, enter “No Match”. As you can see, we found some immediate “No Match” in our list. Some, in the case of “SERVER1” are logical references to another host (eg: SERVERSAN1 in this case), and some are in fact simply missing (eg: FSRVCDFAP1).
9) Create a FILTER on ROW 1, and filter COLUMN C for “No Match”
The hosts in COLUMN B are the ones not present in Nagios.
10) In this case, we are particularly auditing for missing Windows Server Hosts – to be part of the HOSTGROUP “File & Print Servers (Windows) (fpservers)”. (http://servernms1/nagiosxi/includes/components/nagioscore/ui/status.php?hostgroup=fpservers&style=overview)
11) These hosts can then be added to Nagios by means of cloning an existing server in the appropriate HOSTGROUP. Open Nagios XI Configuration Manager, and select HOSTS. Find FSRVVDF1 (a random selection from the above list, but it happens to be the top server).
NOTE: We will be adding FSRVDCFAP1 in this example.
Click on the COPY icon under ACTIONS.
Click on the MODIFY icon next to the new copy.
Modify the HOST NAME and ADDRESS appropriately to reflect the new host SERVERCDFAP1. Click on MANAGE HOSTGROUPS.
Here you can see that because we have cloned the entry, it already is a member of the hostgroup “fpservers”. We could otherwise add/remove hostgroups here. Click CLOSE.
Click MANAGE PARENTS:
This is where you would select a PARENT item. This is useful for setting all devices of a type or location to have a parent device. This parent device can then be modified, set offline, taken out for maintenance, and “all child hosts” can then be set for maintenance as well. Thus, use caution when choosing which original HOST to clone – try to aim for a host that is in the same site. Click CLOSE.
All “*” fields are required. So copy the HOST NAME to the DISPLAY NAME. Ensure you check the box for ACTIVE and click SAVE.
12) You will need to refresh your search, as the new item has changed its name, and your search was for the previous name:
When you do so, you will see the new item as above. If ACTIVE is red and says NO, you failed to check the box. Edit it and do so now.
Check the box next to the HOSTNAME and click APPLY CONFIGURATION.
13) When the configuration applies, you will see:
14) Because this document is not intended to be a “HOWTO: Configure Nagios”, I will not be getting into the finer details of services and SERVICE GROUPS. However if you search for the HOST now, you will see:
You can now add SERVICES as deemed appropriate.
The above steps could just as easily be done with INFRASTRUCTURE list exports, etc.
NOTE: While this is NOT a general NAGIOS HOWTO, it is important to note the following concepts that we simply are not doing well with Nagios today.
SERVICES should be assigned to a SERVICEGROUP.
– EG: CPU, MEMORY, UPTIME, C:, E:, etc should all be SERVICES, and they should all be assigned to a SERVICEGROUP of “WINDOWS SERVERS” as this is our minimum standard
– EG: XENAPP CONNECTED USERS, XENAPP DISCONNECTED USERS, etc, should all be SERVICES and should be assigned to a SERVICEGROUP of “XENAPP SERVERS”.
– EG: MSSQL_CONNECTED_USERS, MSSQL_CPU_BUSY, MSSQL_DATABASE_FREE, MSSQL_IO_BUSY, SQL AGENT SERVICE, SQL SERVICE, etc, should all be SERVICES and be assigned to a SERVICEGROUP of “SQL SERVERS”
HOSTS should be assigned to a HOSTGROUP.
– EG: SERVERXA1 and SERVERXA2 should be assigned to perhaps a few groups – “WINDOWS SERVERS” and “XENAPP SERVERS”
– EG: SERVERDB2, SERVERDB3, SERVERINFDB2, etc should be assigned to groups as well – “WINDOW SERVERS” (for the bare minimum standard) and “SQL SERVERS” to capture SQL SERVERS services.
You can see, that with proper use of HOSTGROUPS and SERVICEGROUPS, that adding new servers would be so incredibly easy, almost as to be amazing. The above method with no services, shows that FOCUS is currently using a method whereby a previous HOST with services is cloned. As there is no grouping or a link to a “template” this means that hosts/services are non-uniform and cannot be updated by updating a central object – which is both time consuming and prone to error. This is something we should fix.
Also, there is a need to standardize on lowercase, or uppercase as Nagios is Linux based and thus, very case sensitive.
HOWTO: Troubleshoot Linux free space issues
As many of us are not Linux administrations (and I’m not admitting to being one myself) it seems prudent to have a quick HOWTO in relation to resolving space issues on a Linux server. Recently we received a Nagios alert that the / (root) partition on the Nagios server itself was full. Here are some steps and tools you can use as basic Linux system administration to locate and resolve these issues.
1) Login the to box using PuTTY to start an SSH session – using a NON-root account. Root is almost always denied direct login.
2) Once logged in, run “su -“ to become root. That’s a single dash character, and the difference between “su” and “su -“ is that the “-“ indicates “and load all my session variables” such as .profile, etc.
3) First, let’s find out what is USING space – run “df -h” to find “Disk Free”. “-h” means “in Human Readable” form – eg: 10G, 8K, 2T, etc.
[root@servernms1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
19G 17G 1017M 95% /
/dev/sda1 99M 31M 64M 33% /boot
tmpfs 1.5G 0 1.5G 0% /dev/shm
So we do in fact see the the “mount” of “/” (on disk /dev/mapper/VolGroup00-LogVol00) is in fact, 95% full.
4) Next, we want to find out where our space is going. Run a “du / –max-depth=1 -h”. DU is “Disk Used”, against the “/” folder, “—max-depth=1” means “I don’t care about _details_ about subfolders, just show me a summary of one folder deep”, and again “-h human readable”. Expect this command to take quite a while to run. This is effectively running “TreeSizeFree” on the C: drive.
[root@servernms1 ~]# du / –max-depth=1 -h
8.0K /media
119M /etc
5.3G /usr
0 /misc
4.4G /store
34M /sbin
26M /boot
0 /sys
du: cannot read directory `/proc/10309′: No such file or directory
du: cannot read directory `/proc/10310′: No such file or directory
0 /proc
8.0K /selinux
20K /mnt
124K /home
0 /net
8.0K /srv
23M /opt
236M /lib
3.4M /tmp
6.1G /var
158M /root
16K /lost+found
64K /dev
114M /bin
17G /
You can expect the “cannot read” on some folders – even root doesn’t have access to some system folders. Still, what we see here is that there are only 3 folders that are in the GB of size.
5) Pick one of those folders, change directory to it, and then re-run the same DU command, only specify the current folder “.” Vs the root folder “/”. As it is only the current level, you’ll get the detail for the new folder.
[root@servernms1 ~]# cd /var
[root@servernms1 var]# du . –max-depth=1 -h
672K ./db
8.0K ./local
140K ./named
16K ./ftp
24K ./empty
20K ./yp
8.0K ./nis
8.0K ./racoon
36K ./lock
8.0K ./preserve
8.0K ./games
2.3G ./spool
12K ./account
8.0K ./tux
8.0K ./rrdtool
2.3G ./log
8.0K ./opt
144M ./cache
1.4G ./lib
784K ./tmp
26M ./www
388K ./run
6.1G .
Again, only 3 folders in the GB. We know/expect that /var/lib will be large – thing C:\WINDOWS\SYSTEM. But /var/spool and /var/log should not be. Likely these are outbound mail files and/or temp/log files.
6) Let’s check on /var/log in the same way:
[root@servernms1 var]# cd /var/log
You have new mail in /var/spool/mail/root
[root@servernms1 log]# du . –max-depth=1 -h
8.0K ./conman
16K ./mail
5.8M ./sa
8.0K ./ppp
19M ./audit
32K ./prelink
16K ./nagios
24K ./cups
8.0K ./vbox
2.1G ./httpd
28K ./news
8.0K ./samba
8.0K ./squid
8.0K ./pm
8.0K ./conman.old
2.3G .
So the first thing we notice is that we’re told there’s new mail in root’s mailbox, while we run the command. That’s odd in and of itself, and likely points to an issue where there is in fact a large amount of outbound mail – perhaps that isn’t getting delivered. Or notices TO root (like an event log) that are not being cleared.
Ignoring that, let’s look at the /var/log/httpd folder:
[root@servernms1 log]# cd /var/log/httpd
[root@servernms1 httpd]# du . –max-depth=1 -h
2.1G .
2.1GB in one folder. Now we just run “ls –lah” to get the list of files:
[root@servernms1 httpd]# ls -lah
total 2.1G
drwx—— 2 root root 4.0K Oct 6 04:04 .
drwxr-xr-x 17 root root 4.0K Oct 10 04:03 ..
-rw-r–r– 1 root root 422M Oct 10 15:35 access_log
-rw-r–r– 1 root root 627M Oct 6 04:04 access_log.1
-rw-r–r– 1 root root 235M Sep 29 04:02 access_log.2
-rw-r–r– 1 root root 471M Sep 22 14:50 access_log.3
-rw-r–r– 1 root root 343M Sep 15 04:02 access_log.4
-rw-r–r– 1 root root 380K Oct 10 15:34 error_log
-rw-r–r– 1 root root 24M Oct 6 04:04 error_log.1
-rw-r–r– 1 root root 82K Sep 29 04:02 error_log.2
-rw-r–r– 1 root root 852K Sep 22 14:50 error_log.3
-rw-r–r– 1 root root 6.1M Sep 15 04:03 error_log.4
-rw-r–r– 1 root root 0 Aug 25 04:02 ssl_access_log
-rw-r–r– 1 root root 39K Aug 22 00:06 ssl_access_log.1
-rw-r–r– 1 root root 70K Aug 18 03:56 ssl_access_log.2
-rw-r–r– 1 root root 56K Aug 11 03:54 ssl_access_log.3
-rw-r–r– 1 root root 70K Aug 4 03:54 ssl_access_log.4
-rw-r–r– 1 root root 237 Oct 6 04:04 ssl_error_log
-rw-r–r– 1 root root 237 Sep 29 04:02 ssl_error_log.1
-rw-r–r– 1 root root 237 Sep 22 14:50 ssl_error_log.2
-rw-r–r– 1 root root 711 Sep 22 13:43 ssl_error_log.3
-rw-r–r– 1 root root 237 Sep 8 04:04 ssl_error_log.4
-rw-r–r– 1 root root 0 Aug 25 04:02 ssl_request_log
-rw-r–r– 1 root root 48K Aug 22 00:06 ssl_request_log.1
-rw-r–r– 1 root root 87K Aug 18 03:56 ssl_request_log.2
-rw-r–r– 1 root root 69K Aug 11 03:54 ssl_request_log.3
-rw-r–r– 1 root root 87K Aug 4 03:54 ssl_request_log.4
I’m not sure we care about 100’s of MB of historical access_log.# – only the “access_log” is current. Let’s get rid of the extra ones.
[root@servernms1 httpd]# rm -rf access_log.?
You have new mail in /var/spool/mail/root
Again, this crazy “new mail”. Note – it says it is in “/var/spool/mail/root” – we identified “/var/spool” as a potential problem folder….
Still if we do a “du” again after deleting the files:
[root@servernms1 httpd]# du . –max-depth=1 -h
454M .
Down from 2.1GB to 454MB. 75% reduction.
7) Now let’s check on /var/spool:
[root@servernms1 spool]# du . –max-depth=1 -h
8.0K ./repackage
8.0K ./lpd
32K ./anacron
2.3G ./mail
16K ./cron
7.4M ./clientmqueue
20K ./at
8.0K ./rwho
16K ./cups
8.0K ./vbox
64K ./news
8.0K ./samba
8.0K ./squid
52K ./mqueue
2.3G .
Shocking – /var/spool/mail is the only GB folder…..
[root@servernms1 spool]# cd /var/spool/mail
[root@servernms1 mail]# du . –max-depth=1 -h
2.3G .
[root@servernms1 mail]# ls -lah
total 2.3G
drwxrwxr-x 2 root mail 4.0K Oct 10 15:35 .
drwxr-xr-x 16 root root 4.0K May 11 2011 ..
-rw-rw—- 1 focusxi mail 0 Jan 30 2011 focusxi
-rw-rw—- 1 nagios mail 230K Apr 18 09:53 nagios
-rw——- 1 root root 2.3G Oct 10 15:35 root
-rw-rw—- 1 rpc mail 0 Jan 8 2010 rpc
-rw-rw—- 1 zuls mail 0 Jun 28 2011 zuls
So root has a 2.3GB Mail file.
8) So let’s try reading and emptying the root mailbox. Run the command “mail”….
[root@servernms1 mail]# mail
/var/spool/mail/root: File too large.
This, I expected. Not much we can do here. We could get into a long and boring HOWTO on how to troubleshoot the file, but unless it is critical, the best thing to do here is simply delete root’s mailbox file and move on. Once it is deleted, it will start growing again, and you can view it to see the mail inbound and deal with the source of the issue directly. Alternatively, you can “cat” and “grep” the file itself, as it is structured text, but expect it to take forever on a 2GB+ file.
9) Check to ensure you now have enough free space, with the “df” command:
[root@servernms1 mail]# df / -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
19G 13G 4.9G 72% /
HOWTO: Monitoring Dell C6100 IPMI with Nagios.
Recently, I’ve been working with Nagios for network monitoring. I have to admit, I came in rather biased, and was frustrated with it. My frustrations will best be covered in another post. In my home lab, however, I decided that I was going to make Nagios sing. This is the first HOWTO I’m doing, although really the first one should have been installing the VM and getting things running. I’ll do that one soon.
While this HOWTO is going to seem very long, once you get used to how to configure the basics, this all has a very nice rhythm to it. Is it better or worse than other monitoring apps? Maybe. But it is what it is – and I don’t think it’s that bad!
GOAL: Monitor IPMI (eg: SuperMicro IPMI, Dell C6100 Series IPMI, Dell IDRAC, IBM RMU, etc) via Nagios.
1) Find your Nagios plugin of choice. I did this by searching the Nagios Plugin Directory for Popularity. This brought me to WFISCHER’s IPMI Sensor Monitoring Plugin (http://exchange.nagios.org/directory/Plugins/Hardware/Server-Hardware/IPMI-Sensor-Monitoring-Plugin/details).
Click on the DOWNLOAD URL, and save the file somewhere – like your DOWNLOADS folder:
Next, unpack the file – it’s a TAR.GZ, with a TAR inside. So use 7Zip or something.
Open the TAR:
Extract this somewhere, such as D:\TEMP2:
2) Open NagiosXI and login.
Click on CONFIGURE on the top and then CORE CONFIG MANAGER on the left.
Click on MONITORING PLUGINS
Click BROWSE, locate the file “check_ipmi_sensor” in the folder above.
Then click UPLOAD PLUGIN.
The plugin is now showing as installed.
In the table in the bottom half of the window, confirm the file is present:
Again click CONFIGURE, and CORE CONFIG MANAGER. Then click APPLY SETTINGS so the uploaded file is now part of the configuration.
There we go, Nagios Core now knows about the config file.
3) Before we go any further, let’s take a look at the README file that came with the package:
Aha! Requirements!
The Nagios VM we downloaded is CentOS 6 based. So use PuTTY and SSH to the host – 10.0.0.150 in my case, and login as “root”, default password “nagiosxi” (you really should change this)
Let’s get FreeIPMI installed. Run “yum install freeipmi”:
In my case, I already have it installed. If it were not installed, it would say that it found the package and ask “Do you wish to install: Y/N” and you would answer yes.
Next, let’s get Perl IPC::Run installed. Run “yum install perl-IPC-Run”:
Same applies here.
NOTE: You may wish to do a general “yum update” and let it update all currently installed packages. That’s up to you, YMMV and if you break it, you bought it.
So now we have our pre-requisites installed.
4) Let’s test the plugin from the command line. Run “cd /usr/local/nagios/libexec”:
Okay, so the plugin IS in the plugins folder! Good.
Now run “./check_ipmi_sensor”:
Guess we’ll need to feed it some parameters. On my C6100, IPMI user is “root” and default password of “root”. (yeah, you should change that too). The priv level is USER or ROOT or something else, but USER is sufficient for read. You may want to create an IPMIuser account vs ROOT, choice is yours. My 4 C6100 nodes IPMI IP addresses are 10.0.0.241-244.
So run “./check_ipmi_sensor -H 10.0.0.241 -U root -P root –L user”
Look at that. It’s practically magic.
Now, we know that the pre-requisites are working and that the check command works from the Linux command line. So if it doesn’t work from here – it’s a Nagios problem!
5) Let’s start by creating a HOSTGROUP. HOSTGROUPS are used to group hosts together (like that?) so that you can manage them by group vs individually. The nice thing about this is say you decide to add a sensor – do you want to add it to 50 devices or 1 host group? I thought so.
Click on CONFIGURE, CORE CONFIG MANAGER. On the left under MONITORING, click HOST GROUP:
Here you can see the default host groups. We’re going to click ADD NEW.
We’re just going to give it a HOSTGROUP NAME and a DESCRIPTION. Note that on the left, we could MANAGE HOSTS and MANAGE HOSTGROUPS – but because we’re starting here, we have none of either. But Nagios is chicken-egg. We could add 40 hosts, then add a hostgroup, then when creating the hostgroup, add the 40 hosts to the hostgroup. Make sure that ACTIVE box is checked. Click SAVE.
And as it says, click APPLY CONFIGURATION to make the changes take effect.
Alright, now let’s go get some hosts!
6) Let’s configure us some Hosts and Services.
Click on CONFIGURE, CORE CONFIG MANAGER. On the left under MONITORING, click HOSTS:
Here you can see I’ve already configured two of the hosts. I’m going to configure the 3rd to show how this looks.
Click ADD NEW.
Enter a HOSTNAME (logical, not actual), ADDRESS (I’m using IP Address as I realized I haven’t set up the IP’s with DNS names yet, my bad), and DISPLAY NAME (probably best to use the same as HOSTNAME – whatever standard makes you happy).
Ensure that ACTIVE on the right is checked. Now, if you’re familiar with Nagios at all (mostly just a little), you’ll think “But….. what about the CHECK COMMAND? We need a check command!”. No, we don’t. Remember, we’re going to add all the services we want to monitor to the HOST GROUP!
Click on the CHECK SETTINGS tab:
Ensure that CHECK INTERVAL is set to something such as 5 minutes, RETRY INTERVAL (such as when it fails the first check) something like 1 minute, and MAX CHECK ATTEMPS = 3-5 – whatever keeps you happy. If this is empty, then later on you’ll get an error.
Click SAVE.
You’ll see that the DATABASE ENTRY was successfully updated. But the SYNC STATUS is SYNC MISSED. We need to APPLY CONFIGURATION – but let’s not do that just yet. Click on the icon to configure the service again.
This time, let’s click on MANAGE HOSTGROUPS.
On the left, under HOSTGROUPS, find the previously created HOSTGROUP “server-hardware” and click ADD SELECTED. Then click CLOSE. Then click SAVE.
We’ve now added the HOST to the HOSTGROUP. We’re not going to configure anything individually on the HOST, we’re going to do it all by HOSTGROUPS.
Here you can see the SYNC MISSED for all 3 hosts, as I’ve added them all to the HOSTGROUP behind the scenes.
Click APPLY CONFIGURATION.
7) Next, in Nagios, click on HOME -> QUICK FIND and enter a substring of “NW-ESX”:
You’ll see a suggestion list pop up, but just click GO:
So what you see here on the first two is that I set them up previously WITH a check command on the host for PING. Ignore this. But what you see is that the two new ones I’ve added show PENDING. And they’ll never get beyond PENDING, as there is no check.
8) Click on ADMIN -> CORE CONFIG MANAGER -> COMMANDS -> COMMANDS:
Here I HAVE already configured the command, but let’s click ADD NEW to simulate what it would look like.
<snip>
So here we want to:
Enter the COMMAND NAME. This is the same command you ran at the command line – “check_ipmi_sensor”. Note that sometimes this might have an extention, such as “check_ipmi_sensor.sh” or “check_ipmi_sensor.pl”, etc. Ours does NOT.
On the commandline enter “$USER1$/check_ipmi_sensor” – this is always going to be the case. $USER1$ is the plug in folder. Same rules apply about watching for an extention to the file.
The other parameters should look familiar based on the command line. –U –P –L relate to USER/PASS/LEVEL. Click SAVE.
Click APPLY CONFIGURATION.
9) Click on CONFIGURE -> CORE CONFIG MANAGER -> MONITORING -> SERVICES:
No services are defined. So let’s click ADD NEW.
Enter the CONFIGNAME and DESCRIPTION. I don’t know that either of these really matters, but I’ve chosen to name them the same as the command. Enter a DISPLAY NAME, this is what you’ll see in the HOSTS/SERVICES list.
Change the CHECK COMMAND to “check_ipmi_sensor” from the list and check ACTIVE. You’ll note the COMMAND VIEW shows the same details we entered in the previous COMMAND configuration. I made a mistake and used ARG2/ARG3/ARG4 thinking HOSTNAME was ARG1, but it doesn’t matter. As long as the variables you put into the ARG’s match their place in the command line.
Click TEST CHECK COMMAND:
Enter an IP address of a sample host, and click OK
Looks like what we got a the command line. Nice. Click CLOSE.
Now click MANAGE HOSTGROUPS:
Click the “server-hardware” hostgroup, click ADD SELECTED and click CLOSE.
Click on the CHECK SETTINGS tab:
Same as for hosts, ensure CHECK INTERVAL, RETRY INTERVAL and MAX CHECK ATTEMPTS are filled in. Click SAVE.
Can you guess what we do now? Click APPLY CONFIGURATION.
10) If you go back to the Nagios window (I keep a NAGIOS and a NAGIOS ADMIN tab open), and click HOME -> QUICK FIND, enter “NW-ESX” and click GO:
You see all 4 of our hosts suddenly have a service! And they’re all pending. Given a little bit of time, they’ll start to check:
Click on the NW-ESXI01-IDRAC CHECK_IPMI_SENSOR service that shows IPMI STATUS: OK
Well that’s boring. I was hoping for more detail. Maybe click PERFORMANCE GRAPHS:
(I had to change the zoom level to get more detail on the screen).
Oh would you look at that. So our one sensor is multi-channeled. We get all our sensors in one polling. It also creates a chart for each of them. That’s pretty handy, so we can now trend our fan/temp/etc.
So what we have done so far is:
· Upload a new plugin.
· Install plugin dependencies
· Test the plugin at the command line to verify it works outside of Nagios
· Create a hostgroup
· Create hosts and add them to a hostgroup
· Create a command from the plugin
· Create a service tied to the command
· Add the service to a hostgroup – which automatically adds them to all hosts in the hostgroup.
· Verified that the hosts individual sensors show all the sensors not just one, and are logging all the historical detail.
To further demonstrate how hostgroups and services work, let’s add another service – just a basic PING service.
11) Click on CONFIGURE -> CORE CONFIG MANAGER -> MONITORING -> SERVICES:
Click ADD NEW
Change the CHECK COMMAND to “check_xi_host_ping”, which is pre-defined. Check ACTIVE. Note that the command wants ARG1-ARG4. These are just timeouts for “-w” (warning level) and “-c” (critical) level. Let’s say that 3,5 and 10,20 (ms response) indicates those levels. Enter the CONFIG NAME, DESCRIPTION (which I again make match the CHECK COMMAND) and then a DISPLAY NAME.
Click TEST CHECK COMMAND:
Click OK:
Looks good here. Click CLOSE.
DON’T CLICK SAVE YET! If you do, and you haven’t modified the CHECK SETTINGS tab, the APPLY CONFIGURATION will bitch J Click CHECK SETTINGS tab:
Again, make sure that CHECK INTERVAL=5, RETRY INTERVAL=1 and MAX CHECK ATTEMPTS=3. Note that INITIAL STATE can be set to W(arning), C(ritical), O(perational) or U(nknown). Might want to set that to O.
Click back on COMMON SETTINGS.
Click on MANAGE HOSTGROUPS.
Click on the “*” and click ADD SELECTED. It’s reasonable to assume we want a PING sensor on EVERY HostGroup, yes? If you click on the 3 that were listed here only and added them, and then later added a 4th net-new hostgroup, it would not have this PING sensor. For now, let’s just add it to our SERVER-HARDWARE hostgroup. Click CLOSE. NOW click SAVE J
Let’s click APPLY CONFIGURATION.
12) By now you’ll be familiar with: If you go back to the Nagios window (I keep a NAGIOS and a NAGIOS ADMIN tab open), and click HOME -> QUICK FIND, enter “NW-ESX” and click GO:
Look at that. All the hosts in the hostgroup are now checking Ping as well J
And moments later show all okay.
Here you can see how the SERVICE DESCRIPTION “check_xi_host_ping” works. IF we go back and change that just to “Ping”:
And then click SAVE, and APPLY, then come back to the HOSTS view:
Ta-da!
I’m going to go through all the same steps, without displaying them, and add a HTTPS sensor, as the IPMI cards are web manageable. We want to know if the WebUI on them should happen to die.
And look at that.
So as you can see, HOSTGROUPS and SERVICE/SERVICEGROUPS are key to making NAGIOS really sing. I have NOT touched on ALERTS, CONTACTS, ALERT PERIODS, etc. For now, let’s worry about if we can get Nagios *monitoring* what we want.