Scalable Monitoring
Distributed Monitoring
Introduction
The op5 Monitor backend can easily be configured to be used as a distributed monitoring solution. The distributed model looks like this.

In the distributed monitoring solution
•all configuration is done at the Master
•all new configuration is distributed to the pollers
•each poller is responsible for its own host group (Site).
•the Master has all the status information
Before we start
There are a few things you need to take care of before you can start setting up a distributed monitoring solution. You need to make sure
•you have at least two op5 Monitor servers of the same architecture up and running.
•op5 Monitor >=5.2 is installed and running on both machines.
•opened up the following TCP ports for communication between the servers
-15551, op5 Monitor backend communication port
-22, ssh (for the configuration sync).
-both included servers are to be found in DNS.
•Make sure the host group, the one the poller will be responsible for, is added to the master configuration and that at least one host is added to that host group.
The configuration
Setting up the new distributed monitoring solution
This distributed configuration will have one master and one poller:
•master01
•poller01
The poller will be monitoring the host group gbg.
During the setup we will use the command:
mon
The mon command is used to make life a bit easier when it comes to setting up a load balanced solution. To get more detailed information about the command mon just execute like this:
mon --help
To setup a distributed monitoring solution with one poller
1 Log in to the master over ssh, as root.
2 Add the new poller to the configuration with the following command:
mon node add poller01 type=poller hostgroup=gbg
3 Create and add ssh keys to and from the second peer by
as root user:
mon sshkey push --all
mon sshkey fetch --all
4 Add master01 as master at poller01:
mon node ctrl --type=poller -- mon node add master01 type=master
5 Set up the configuration sync:
dir=/opt/monitor/etc/oconf
conf=/opt/monitor/etc/nagios.cfg
mon node ctrl -- sed -i /^cfg_file=/d $conf
mon node ctrl -- sed -i /^log_file=/acfg_dir=$dir $conf
mon node ctrl -- mkdir -m 775 $dir
mon node ctrl -- chown monitor:apache $dir
6 To make sure you have an empty configuration on poller01:
mon node ctrl -- mon oconf hash
This will give you an hash looking like this (“da 39” -hash):
da39a3ee5e6b4b0d3255bfef95601890afd80709
7 Now push the configuration to the poller:
mon oconf push
8 Restart and push the logs from master01 to poller01:
mon restart; sleep 3; mon log push
Adding a new poller
In this instruction we will add a new poller to our distributed solution. Here we have the following hosts:
•master01
•poller01
•poller02 (This is the new one.)
To add a new poller
1 Log in to the master over ssh, as root.
2 Add the new poller to the configuration with the following command:
mon node add poller02 type=poller hostgroup=gbg
3 Create and add ssh keys for the root user:
mon sshkey push poller02
mon sshkey fetch poller02
4 Add master01 as master at poller02:
mon node ctrl poller02 -- mon node add master01 type=master
5 Set up the configuration sync:
dir=/opt/monitor/etc/oconf
conf=/opt/monitor/etc/nagios.cfg
mon node ctrl poller02 -- sed -i /^cfg_file=/d $conf
mon node ctrl poller02 -- sed -i /^log_file=/acfg_dir=$dir $conf
mon node ctrl poller02 -- mkdir -m 775 $dir
mon node ctrl poller02 -- chown monitor:apache $dir
6 To make sure you have an empty configuration on poller01:
mon node ctrl poller02 -- mon oconf hash
This will give you an hash looking like this (“da 39” -hash):
da39a3ee5e6b4b0d3255bfef95601890afd80709
7 Now push the configuration to the poller:
mon oconf push
8 Restart and push the logs from master01 from poller01:
mon restart; sleep 3; mon oconf push
Adding a new host group to a poller
You might want to add an other host group for to a poller. You need to edit the merlin.conf file to do that. This is not doable with any command as it is today.
To add new host group to a poller
1 Open up and edit /opt/monitor/op5/merlin/merlin.conf.
2 Add a new host group in the hostgroup line like this:
hostgroup = gbg,sth,citrix_servers
Remember to not put any space between the hostgroup name and comma.
3 Restart monitor on the poller
mon restart
4 Send over the new configuration to the poller
mon oconf push
Removing a poller
In this instruction we will remove a poller called:
poller01
The poller will be removed from the master configuration and all distributed configuration on the poller will also be removed.
To remove a poller
1 Log in to the master over ssh, as root.
2 Deactivate and remove all distributed setup on the poller host.
mon node ctrl poller01 -- mon node remove master01
3 Restart monitor on the poller.
mon node ctrl poller02 -- mon restart
4 Remove the poller from the master configuration.
mon node remove poller01
5 Restart monitor on the master.
mon restart
Master takeover
If a poller goes down the default configuration is for the master to take over all the checks from the poller. For this to work all hosts monitored from the poller most also be monitorable from the master.
If the master server not should take over the checks from the poller this can be set in the merlin configuration file.
To stop the master from taking over, edit the file /opt/monitor/op5/merlin/merlin.conf
Add the following to the poller that you want the master not to take over.
takeover = no
Note that this is done per poller.
File synchronization
To synchronize files from the master server to the poller add a sync paragraph in the file /opt/monitor/op5/merlin/merlin.conf
In the example below we will synchronize the htpasswd.users file to the poller “poller01”
poller poller01 {
address = <ip>
port = <port>
contact_group = <contactgroup>
sync {
/opt/monitor/etc/htpasswd.users /opt/monitor/etc/htpasswd.users
}
}
Note that this is done per poller
One way connections
If one peer is behind some kind of firewall or is on a NAT address it might not be possible for the master server to connect to the peer.
To tell the master not to connect to the poller and let the poller open the session we need to add a option to the file /opt/monitor/op5/merlin/merlin.conf.
Under the section for the poller that the master should not try to connect to add the following:
connect = no
Example
In the example below we have a master “master01” that can not connect to “poller01” but “poller01” is allowed to connect to “master01”.
poller poller01 {
address = <ip>
port = <port>
contact_group = <contactgroup>
connect = no
}
Is is also possible to set this option on the peer instead then the master will always initiate the session.
Recovery
After a poller as been unavailable for a master (i.e of network outage) the report data will be synced from the poller to the master.
The report data on the poller will overwrite the data on the master system
More information
For more information and a more complex example please take a look at the howto in the git repository of the opensource project of Merlin:
Load balanced monitoring
Introduction
The op5 Monitor backend can easily be used as a load balanced monitoring solution. The load balanced model looks like this.

The load balanced solution
•have two or more peers sharing the same task (the hosts to monitor)
•allows configuration at any of the peers
•make sure that all new config is distributed to the peers
•uses the peers to dived the load automatically
•keep tracks of when one peer go down, the other(s) take over the job.
Before we start
There are a few things you need to take care of before you can start setting up an load balanced monitoring. You need to make sure
•you have at least two op5 Monitor servers of the same architecture up and running.
•op5 Monitor >=5.2 is installed and running on both machines.
•opened up the following TCP ports for communication between the servers
•15551, op5 Monitor backend communication port
•22, ssh (for the configuration sync).
•both included servers are to be found in DNS or the host file (/etc/hosts).
The configuration
Setting up the load balanced solution
This load balanced configuration will have two so called peers:
•peer01
•peer02
During the setup we will use the command:
mon
The mon command is used to make life a bit easier when it comes to setting up a load balanced solution. To get more detailed information about the command mon just execute like this:
mon --help
To setup a load balanced monitoring solution
1 Log in to one of the systems over ssh, as root.
2 Add the second peer to the configuration with the following command:
mon node add peer02 type=peer
3 Create and add ssh keys to and from the second peer by
as root user:
mon sshkey push --all
mon sshkey fetch --all
4 Add peer01 as a peer at peer02
mon node ctrl peer02 -- mon node add peer01 type=peer
5 Make the first initial configuration sync
mon oconf push
6 Restart and push the logs from peer01 to peer02:
mon restart; sleep 3; mon oconf push
Adding a new peer
In this instruction we will have the following hosts:
•peer01
•peer02
•peer03 (This is the new one.)
To add a new peer
1 Login to the peer01 as root user over ssh.
2 Add the new peer to the configuration on peer01
mon node add peer03 type=peer
3 Get all ssh keys in place
mon sshkey push --all
mon sshkey fetch --all
4 Add the peers to one and each other
mon node ctrl peer02 -- mon node add peer03 type=peer
mon node ctrl peer03 -- mon node add peer02 type=peer
mon node ctrl peer03 -- mon node add peer01 type=peer
5 Manually push the op5 Monitor objects configuration to the new peer.
mon oconf push
6 Restart monitor on peer01 and send the configuration to all peers again.
mon restart ; sleep 3 ; mon oconf push
Removing a peer
In this instruction we will remove a peer called:
peer02
The peer will be removed from all other peers configurations.
To remove a peer
1 Log in to peer01 as root over ssh.
2 Remove all peer configuration from peer02
mon node ctrl peer02 -- mon node remove peer01
mon node ctrl peer02 -- mon node remove peer03
3 Restart monitor on peer02
mon node ctrl peer02 -- mon restart
4 Remove peer02 from the rest of the peers, in this case peer03
mon node ctrl --type=peer -- mon node remove peer02
5 Restart the rest of the peers, in this case only peer03
mon node ctrl --type=peer -- mon restart
6 Remove peer02 from the host you are working from.
mon node remove peer02
7 Restart monitor on the host you are working from.
mon node ctrl -- mon restart
File synchronization
To synchronize files between servers add a sync paragraph in the file /opt/monitor/op5/merlin/merlin.conf
In the example below we will synchronize the htpasswd.users file to the peer “peer01”
peer peer01 {
address = <ip>
port = <port>
sync {
/opt/monitor/etc/htpasswd.users /opt/monitor/etc/htpasswd.users
}
}
Note that this is done per peer.
More information
For more information and a more complex example please take a look at the howto in the git repository of the opensource project of Merlin:
Merlin
About
Merlin is the backend engine for a load balanced and/or distributed setup of op5 Monitor.
Merlin, or Module for Effortless Redundancy and Load balancing In Nagios, allows the op5 Monitor processes to exchange information directly as an alternative to the standard nagios way using NSCA.
Merlin functions as backend for Ninja by adding support for storing the status information in a database, fault tolerance and load balancing. This means that Merlin now are responsible for providing status data and acts as a backend, for the Ninja GUI.
Merlin components
merlin-mod
merlin-mod is responsible for jacking into the NEBCALLBAC_* calls and send them to a socket.
If the socket is not available the events are written to a backlog and sent when the socket is available again.
merlind
The Merlin deamon listens to the socket that merlin-mod writes to and sends all events received either to a database of your choice (using libdbi) or to another merlin daemon.
If the daemon is unsuccessful in this it writes to a backlog and sends the data later.
merlin database
This is a database that includes Nagios object status and status changes. It also contains comments, scheduled downtime etc.
Illustration
This picture illustrates the components described above
The mon command
About
The mon command is a very power command that comes with merlin.
It is this command that is used to setup a distributed or a load balanced environment.
This command can also be used to control the other op5 monitor servers.
| The mon command is very powerful. Handle with care! It has the power to both create and destroy your whole op5 installation. |
The commands
To use the mon command just type
# mon
The command should be used with one category and one sub-category. Only start, stop and restart categories can be used without any sub-category.
Start
# mon start
This will start the op5 monitor process on the node that you run the command from.
Stop
# mon stop
This will stop the op5 monitor process on the node you run the command from.
Restart
#mon restart
This will restart the op5 monitor process on the node you run the command from.
Ascii
Ninja
# mon ascii ninja
This will display the ninja logo in ascii art.
Merlin
# mon ascii merlin
This will display the merlin logo in ascii art.
Check
Spool
# mon check spool [--maxage=<seconds>] [--warning=X] [--critical=X] <path> [--delete]
Checks a certain spool directory for files (and files only) that are older than 'maxage'. It's intended to prevent buildup of checkresult files and unprocessed performance-data files in the various spool directories used by op5 Monitor.
--delete | Causes too old files to be removed. |
--maxage | Is given in seconds and defaults to 300 (5 minutes). |
<path> | May be 'perfdata' or 'checks', in which case directory names will be taken from op5 defaults |
--warning and --critical | Have no effect if '--delete' is given and will otherwise specify threshold values. |
| Only one directory at a time may be checked. |
Cores
# mon check cores --warning=X --critical=X [--dir=]
Checks for memory dumps resulting from segmentation violation from core parts of op5 Monitor. Detected core-files are moved to /tmp/mon-cores in order to keep working directories clean.
--warning | Default is 0 |
--critical | Default is 1 (any corefile results in a critical alert) |
--dir | Lets you specify more paths to search for corefiles. This option can be given multiple times. |
--delete | Deletes corefiles not coming from 'merlind' or 'monitor'. |
Distribution
#mon check distribution [--no-perfdata]
Checks to make sure distribution works ok.
| Note that it's not expected to work properly the first couple of minutes after a new machine has been brought online or taken offline |
Exectime
# mon check exectime [host|service] --warning=<min,max,avg> --critical=<min,max,avg>
Checks execution time of active checks.
[host|service] | Select host or service execution time. |
--warning | Set the warning threshold for min,max and average execution time, in seconds |
--critical | Set the critical threshold for min,max and average execution time, in seconds |
Latency
# mon check latency [host|service] --warning=<min,max,avg> --critical=<min,max,avg>
Checks latency time of active checks.
[host|service] | Select host or service latency time. |
--warning | Set the warning threshold for min,max and average execution time, in seconds |
--critical | Set the critical threshold for min,max and average execution time, in seconds |
Orphans
#mon check orphans
Checks for checks that haven't been run in too long a time.
db
cahash
Calculates a hash of all entries in the contact_access table. This is really only useful for debugging purposes. The check does not block execution of other scripts or checks.
Fixindexes
Fixes indexes on merlin tables containing historical data.
| Don't run this tool unless you're asked to by op5 support staff or told to do so by a message during an rpm or yum upgrade. |
ecmd
Search
# mon ecmd search <regex>
Prints 'templates' for all available commands matching <regex>.
The search is case insensitive.
Submit
# mon ecmd submit [options] command <parameters>
Submits a command to the monitoring engine using the supplied values.
Available options:
--pipe-path=</path/to/nagios.cmd>
Example:
An example command to add a new service comment for the service PING on the host foo would look something like this:
# mon ecmd submit add_svc_comment service='foo;PING' persistent=1 author='John Doe' comment='the comment'
Note how services are written. You can also use positional arguments, in which case the arguments have to be in the correct order for the command's syntactic template. The above example would then look thus:
# mon ecmd submit add_svc_comment 'foo;PING' 1 'John Doe' 'the comment'
Log
Fetch
# mon log fetch [--incremental=<timestamp>]
Fetches logfiles from remote nodes and stashes them in a local path, making them available for the 'sortmerge' command.
Import
# mon log import [--fetch]
This commands run the external log import helper.
If --fetch is specified, logs are first fetched from remote systems and sorted using the merge sort algorithm provided by the sortmerge command.
Purge
#mon log purge
Remove log files that are no longer in use.
| Currently only deletes stale RRD files. |
Push
#mon log push
(documentation missing)
Show
#mon log show
Runs the showlog helper program. Arguments passed to this command will get sent to the showlog helper.
For further help about the show category use:
#mon log show --help
Sortmerge
#mon log sortmerge [--since=<timestamp>]
Runs a mergesort algorithm on logfiles from multiple systems to create a single unified logfile suitable for importing into the reports database.
Node
Add
# mon node add <name> type=[peer|poller|master] [var1=value] [varN=value]
Adds a node with the designated type and variables.
Ctrl
#mon node ctrl <name1> <name2> [--self] [--all|--type=<peer|poller|master>] -- <command>
Execute <command> on the remote node(s) named. --all means run it on all configured nodes, as does making the first argument '--'.
--type=<types> means to run the command on all configured nodes of the given type(s).
The first not understood argument marks the start of the command, but always using double dashes is recommended. Use single-quotes to execute commands with shell variables, output redirection or scriptlets, like so:
# mon node ctrl -- '(for x in 1 2 3; do echo $x; done) > /tmp/foo'
# mon node ctrl -- cat /tmp/foo
List
#mon node list [--type=poller,peer,master]
Lists all nodes of the (optionally) specified type
Remove
#mon node remove <name1> [name2] [nameN]
Removes one or more nodes from the merlin configuration.
Show
#mon node show [--type=poller,peer,master]
Display all variables for all nodes, or for one node in a fashion suitable for being used as eval $(mon node show nodename) from shell scripts and scriptlets.
Status
#mon node status
Show status of all nodes configured in the running Merlin daemon.
Red text points to problem areas, such as high latency or the node being inactive, not handling any checks, or not sending regular enough program_status updates.
oconf
Changed
#mon oconf changed
Print last modification time of all object configuration files
Fetch
#mon oconf fetch
Fetch the configuration from a Master, this is executed on a poller system. Useful when the poller can talk to the master but not vice verca.
Files
#mon oconf files
Print the configuration files in alphabetical order
Hash
#mon oconf hash
Print sha1 hash of running configuration
HGlist
#mon oconf hglist
Print a sorted list of all configured hostgroups
Nodesplit
#mon oconf nodesplit
Same as 'split', but use merlin's config to split config into configuration files suitable for poller consumption
Pull
#mon oconf pull
(documentation missing)
Push
#mon oconf push
Splits configuration based on merlin's peer and poller configuration and send object configuration to all peers and pollers, restarting those that receive a configuration update. ssh keys need to be set up for this to be usable without admin supervision.
This command uses 'nodesplit' as its backend.
Spit
#mon oconf split <outfile:hostgroup1,hostgroup2,hostgroupN>
Write config for hostgroup1,hostgroup2 and hostgroupN into outfile.
SSHKey
Fetch
#mon sshkey fetch
Fetches all the SSH keys from peers and pollers.
Push
#mon sshkey push
Pushes the local SSH keys to all peers and pollers.
Sysconf
Ramdisk
#mon sysconf ramdisk
To enable ramdisk:
#mon sysconf ramdisk
A ramdisk can be enabled for storing spools for performance data and checkresults.
By storing these spools on a ramdisk we can lower the disk I/O significant
Test
| All commands in this category can potentially overwrite configuration, enable or disable monitoring and generate notifications. Do NOT use these commands in a production environment. |
Dist
#mon test dist [options]
Tests various aspects of event forwarding with any number of hosts, services, peers and pollers, generating config and creating databases for the various instances and checking event distribution among them.
For complete list of option, run
#mon test --help
Pasv
# mon test pasv [options]
Submits passive checkresults to the nagios.cmd pipe and verifies that the data gets written to database correctly and in a timely manner.
For complete list of option, run
#mon test --help
| This command will disable active checks on your system and have other side-effects as well. |
VRRP
About
VRRP can be used in this setup to have one DNS-name and one IP address that is primary linked to one of the master servers and if the primary master for some reason is unavailable VRRP will automatically detect this and send you to the secondary master.
Setup
To enable VRRP on you master servers follow the steps below.
In this example we have two masters that we want to use VRRP with.
The VRRP IP will be 192.168.1.3 and we will bind that IP to the network interface eth0.
The IP and interface will have to change to match your network configuration.
| If you already use VRRP in your network, make sure that you use the correct virtual_router_id. |
Edit /etc/keepalived/keepalived.conf
On the “primary” master
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 200
advert_int 1
virtual_ipaddress {
192.168.1.3 dev eth0
}
}
On the “secondary” master
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 100
advert_int 1
virtual_ipaddress {
192.168.1.3 dev eth0
}
}
Activate VRRP
To activate vrrp run the following command:
# chkconfig keepalived on