[Nagios-users] Strange problem with plugins timing out

Discussion:

Sean Carolan

2009-01-29 20:13:21 UTC

We have two Nagios servers each for monitoring different networks.
The production network has over 1200 service checks and the average
host check time is around 4 seconds:

Host Check Execution Time: 4.03 / 4.15 / 4.039 sec

The UAT network has only 120 checks. For some reason, starting
yesterday we have seen a huge spike in the average Host Check
Execution Time:

Host Check Execution Time: 4.03 / 24.09 / 16.236 sec

This is causing all sorts of false alarms. I tried to log onto the
server and run some checks from the command line and indeed, the
check_ping plugin runs really, really slow. The odd thing is that if
I just do a standard "ping hostname" it's nice and fast. We have not
changed or updated anything on this Nagios server, nor are we seeing
any kind of elevated CPU usage.

Has anyone else experienced anything like this? I'm not sure where to
look to start troubleshooting the problem.

Sean Carolan

2009-01-30 13:42:57 UTC

Permalink

Post by Sean Carolan
Has anyone else experienced anything like this? I'm not sure where to
look to start troubleshooting the problem.

I was able to alleviate the service check alarms by increasing my
check_by_ssh plugin timeout to 30 seconds, however I'm still getting
timeouts on ping tests, even inspite of the -t 60 flag on my ping
command.

Additional Info:

CRITICAL - Plugin timed out after 10 seconds

Here's how my check_by_ping plugin is configured:

command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c
5000.0,100% -p 5 -t 60

Can anyone help me? Why is it still timing out after ten seconds,
even when I have explicitly set the timeout at 60 seconds??

Sean Carolan

2009-01-30 14:59:18 UTC

Permalink

Post by Sean Carolan
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c
5000.0,100% -p 5 -t 60
Can anyone help me? Why is it still timing out after ten seconds,
even when I have explicitly set the timeout at 60 seconds??

The FAQ page seems to indicate that the default timeout for most
plugins is 10 seconds. Why are my plugins still timing out even
though I have "-t 60"?

<quote>
First you need to identify where the timeout is occurring. Most
plugins time out after 10 seconds of not being able to contact a
service (FTP, HTTP, etc). If the plugins are timing out after a short
period of time, increase the timeout value for the plugin using the
appropriate command line argument for that plugin. This may be done in
either the command definition, or in individual service definitions.

In addition to plugins having timeouts, Nagios enforces its own
timeout value on all service checks that run. By default, this is set
to 30 seconds. If the plugin executes for more than 30 seconds, Nagios
will automatically kill it off and return a critical error for that
service. If you see entries in the log file that say a service check
timed out, this may be your problem. You can adjust the maximum
timeout value for service checks by using the service_check_timeout
directive in the main configuration file.

As a side note, there are also directives in the main config file for
setting the maximum timeout for host checks, notifications, event
handlers, and the ocsp command.
</quote>

Marc Powell

2009-01-30 18:44:37 UTC

Permalink

Post by Sean Carolan

The FAQ page seems to indicate that the default timeout for most
plugins is 10 seconds. Why are my plugins still timing out even
though I have "-t 60"?

A few possible reasons --
- you're not editing the right command (i.e. there are multiple
check_ping definitions)
- your master nagios timeout is set to 10 seconds (unlikely given
your ssh check success)
- you're not restarting nagios after changing it
- you have multiple nagios daemons running, one or more with the old
config.

--
Marc

Thomas Guyot-Sionnest

2009-01-31 05:22:36 UTC

Permalink

Post by Sean Carolan

The FAQ page seems to indicate that the default timeout for most
plugins is 10 seconds. Why are my plugins still timing out even
though I have "-t 60"?

Regarding your ping check, it may be trying to perform reverse DNS
lookups and when this times out it always take time. Try fixing the DNS
if you can, or use check_icmp instead (the plugin must be setuid root).

Besides the plugin timeouts, you also have the Nagios check timeouts
(nagios.cfg) and, when running plugins trough NRPE or SSH, the
check_{nrpm,ssh} timeout.

Hope this helps,

- --
Thomas