What has me a little concerned is that if someone went into the web
interface on the main server and say scheduled downtime or disabled
notifications, the backup server would never know about it. In the even
to failure people could find themselves getting alerts for a host that
should have been in scheduled downtime (or it was on the main server).
While I realize I would not want to capture and retransmit *all*
external commands to the backup host, if I could somehow get at them I
could filter them over to the backup host (i.e. "ignore most commands,
but pass a few like downtime or host notifications", etc).
Is there any mechanism that allows me to do this? As I understand it
the global host and service events really only capture check results --
they're not going to fire if someone schedules downtime.
I have the same dilemma. I don't think Nagios was designed for multiple
web interface servers and anything you do to try to fix it may be too
hackish (as you're about to see). I wonder how the Nagios-based commercial
apps handle this.
I have written a set of scripts that will scour the Nagios log files and
relay selected commands to the backup. The set up is complicated and
I'm starting to think it is the wrong way to go but I'll present it for
On the main server there is a cron job that runs every two minutes:
*/2 * * * * nagios /usr/lib64/nagios/plugins/relay_to_secondary
The script basically does this:
[ -f "/var/log/nagios/rw/stop_relay" ] && exit 0
The file /var/log/nagios/rw/last_relayed keeps track of the location
in the log file(s) of the last line checked for relaying. The
"--update-start" option updates /var/log/nagios/rw/last_relayed with
the last line checked this run.
The "relay_commands" script is attached at the end of this message. It
parses the log files for (a configurable set of) commands to relay to
the backup and then ssh's (with passwordless keys) to the backup and
cat's these commands to the nagios external command pipe.
Under normal operation everything is OK. However, if the main or backup
servers go down there is additional work besides enabling/disabling
notifications that needs to be done in the event handler.
When the backup goes down you don't want to relay commands so the event
handler on the main will create the /var/log/nagios/rw/stop_relay file.
When the backup comes back you want to start relaying commands so the event
handler on the main will delete /var/log/nagios/rw/stop_relay.
When the main goes down the event handler on the backup
gets the last line in the log file and writes it to a file <foo>.
When the main comes back the event handler on the backup relays
the commands back to the main using the file <foo> as the starting
But wait, there's more! I would like to relay acks/comments and their
deletion as well. However, the "delete comment" command takes an ID
number. If your comments are not exactly synchronized then the wrong
one will be deleted.
Mark Wagner <***@u.washington.edu>
System Administrator, UW Medicine IT Services