Nagios On-Demand Macros and cluster service

This post will address two points in Nagios :

  1. Cluster service
  2. On-Demand Macros

this little documentation should provide you with some information on how to monitor clusters of services.
Imagine we have a host running several services (service1, service2 .. serviceN) and some services are monitoring the same application but in a different way.
Example:

service appli : raise an alert if appli is not running
service appli perf : raise an alert if performance of appli are bad (> critical thresold)
service appli load : raise an alert if the memory or CPU load of appli are bad (> critical thresold)

The problem is when the load is bad, performance are also bad and two alerts are raised for only one problem.
We want to raise only one alert if one service or more are in a critical state. Let’s create a cluster to do that !

Plugin check_cluster

We have to use the plugin check_cluster (http://nagiosplugins.org/man/check_cluster) :

Usage: check_cluster (-s | -h) -d val1[,val2,...,valn] [-l label]
[-w threshold] [-c threshold] [-v] [--help]

Options:
 --extra-opts=[section][@file]
    Read additionnal options from ini file
 -s, --service
    Check service cluster status
 -h, --host
    Check host cluster status
 -l, --label=STRING
    Optional prepended text output (i.e. "Host cluster")
 -w, --warning=THRESHOLD
    Specifies the range of hosts or services in cluster that must be in a
    non-OK state in order to return a WARNING status level
 -c, --critical=THRESHOLD
    Specifies the range of hosts or services in cluster that must be in a
    non-OK state in order to return a CRITICAL status level
 -d, --data=LIST
    The status codes of the hosts or services in the cluster, separated by
    commas
 -v, --verbose
    Show details for command-line debugging (Nagios may truncate output)

Examples:

#  Will alert critical if there are more than 1 service in a non-OK state.

$ /usr/lib64/nagios/plugins/check_cluster -s -l "my service cluster" -c 1 -d 0,0,0,0
CLUSTER OK: my service cluster: 4 ok, 0 warning, 0 unknown, 0 critical

# Will alert critical if there are more than 1 service in a non-OK state.

$ /usr/lib64/nagios/plugins/check_cluster -s -l "my service cluster" -c 1 -d 0,1,0,0
CLUSTER OK: my service cluster: 3 ok, 1 warning, 0 unknown, 0 critical

# Will alert critical if there are 1 or more services in a non-OK state.

$ /usr/lib64/nagios/plugins/check_cluster -s -l "my service cluster" -c @1 -d 0,1,0,0
CLUSTER CRITICAL: my service cluster: 3 ok, 1 warning, 0 unknown, 0 critical

Threshold format

You can have a look at http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT for THRESHOLD format and examples:

This is the generalised format for ranges:
[@]start:end

Notes:

  1. start ≤ end
  2. start and “:” is not required if start=0
  3. if range is of format “start:” and end is not specified, assume end is infinity
  4. to specify negative infinity, use “~”
  5. alert is raised if metric is outside start and end range (inclusive of endpoints)
  6. if range starts with “@”, then alert if inside this range (inclusive of endpoints)

Note: Not all plugins are coded to expect ranges in this format yet. There will be some work in providing multiple metrics.
Table 3. Example ranges

Range definition Generate an alert if x…
10 < 0 or > 10, (outside the range of {0 .. 10})
10: < 10, (outside {10 .. ∞})
~:10 > 10, (outside the range of {-∞ .. 10})
10:20 < 10 or > 20, (outside the range of {10 .. 20})
@10:20 ≥ 10 and ≤ 20, (inside the range of {10 .. 20})

New paragraph

Monitoring Service Clusters

Let’s say you have three DNS servers that provide redundant services on your network. First off, you need to be monitoring each of these DNS servers separately before you can monitor them as a cluster. We’ll assume that you already have three separate services (all called “DNS Service”) associated with your DNS hosts (called “host1”, “host2” and “host3”).

In order to monitor the services as a cluster, you’ll need to create a new “cluster” service. However, before you do that, make sure you have a service cluster check command configured. Let’s assume that you have a command called check_service_cluster defined as follows:

define command {
        command_name    check_service_cluster
        command_line    /usr/lib64/nagios/plugins/check_cluster --service -l $ARG1$ -w $ARG2$ -c $ARG3$ -d $ARG4$
}

Now you’ll need to create the “cluster” service and use the check_service_cluster command you just created as the cluster’s check command. We’ll have to pass to ARG4 the service states of all services in the cluster. It’s here we will use on-demand macros.

Nagios on-demand macros

If you would like to reference values for another host or service in a command (for which the command is not being run), you can use what are called “on-demand” macros. On-demand macros look like normal macros, except for the fact that they contain an identifier for the host or service from which they should get their value. Here’s the basic format for on-demand macros:

  • $HOSTMACRO:host_name$
  • $SERVICEMACRO:host_name:service_description$

Note that the macro name is seperated from the host or service identifier by a colon (:). For on-demand service macros, the service identifier consists of both a host name and a service description – these are seperated by a colon (:) as well.

Examples of on-demand host and service macros follow:

$HOSTDOWNTIME:myhost$
$SERVICESTATEID:novellserver:DS Database$

Let’s use the on-demand macros for our cluster.

The example below will generate a CRITICAL alert if 2 or more services in the cluster are in a non-OK state, and a WARNING alert if only 1 of the services is in a non-OK state. If all the individual service members of the cluster are OK, the cluster check will return an OK state as well.

define service {
        ...
        check_command   check_service_cluster!"DNS Cluster"!0!1!$SERVICESTATEID:host1:DNS Service$,$SERVICESTATEID:host2:DNS Service$,$SERVICESTATEID:host3:DNS Service$
        ...
}

It is important to notice that we are passing a comma-delimited list of on-demand service state macros to the $ARG4$ macro in the cluster check command. That’s important! We can use on-demand macros in with the current service state IDs (numerical values, rather than text strings) of the individual members of the cluster.

But imagine if you thousands of services hosts to add in your cluster, that’s becoming tricky to fill ARG4..

Can we use something like $SERVICESTATEID:$HOSTNAME$:service name$ ? Unfortunately it doesn’t works..

So we will have to use a trick:

we will use instead $SERVICESTATEID:servicegroup name:,$ , this on-demand macro will return the status of service name for each host in servicegroup name but not formatted as expected (Example : 0,1,0,2)

So we will create a new plugin script to replace servicename by a comma to meet the check_cluster format :

Create a new plugin called /usr/lib64/nagios/plugins/check_servicecluster

#!/bin/bash
/usr/lib64/nagios/plugins/check_cluster -s -l $1 -c $2 -d $3

Now define a Nagios command :

define command {
        command_name                    check_servicecluster
        # $ARG1$ = the critical threshold
        # $ARG2$ = the data list
        command_line                    /usr/lib64/nagios/plugins/check_servicecluster "Services Cluster description" $ARG1$ $ARG2$
}

Define a Nagios service group for the services you want to monitor and declare it in the service definition :

define servicegroup{
        servicegroup_name               my_servicegroup
        alias                           my_servicegroup
}
define service{
...
        hostgroup_name                  my_hostgroup       ; a hostgroup with several hosts
        service_description             my_service         ; the service name
        servicegroups                   my_servicegroup    ; declare this service part of my_servicegroup group
...
}

And now define the service cluster to monitor all my_service services :

define service {
         use                             active-service
         hostgroup_name                  my_hostgroup
         normal_check_interval           1
         service_description             my_servicecluster
         servicegroups                   my_servicegroup
         # declare the check command with critical threshold = 1 (>1)
         # to check the state of all service my_service declared in the service group my_servicegroup :
         check_command                   check_servicecluster!1!$SERVICESTATEID:my_servicegroup:my_service$
         contact_groups                  my_contactgroup
         register                        1
}

So now if we have hundreds hosts in the service group they are all declared in the cluster with a simple on-demand macro !

bmailhe