Discussion:
[Nagios-users] Distributed monitoring: central collector doesn't seem to be able to run active checks
C. Bensend
2013-08-22 11:51:07 UTC
Permalink
Hey folks,

I'm continuing to iron out the wrinkles with 3.5.1 and distributed
monitoring. I'm using mod_gearman to submit and receive events from
two distributed pollers.

Every now and again, I'll get something similar in the log on the
centralized collecting machine:

CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
youre trying to run actually exists. (worker: collector.domain.org)

To me, that suggests that the collector system didn't get a result
for a host or service in a timely manner from one of the polling
systems, and so it attempted to run an active check itself. However,
it doesn't seem to be able to, and I don't know why.

The collector has the same value for $USER1$, and it has the same
set of plugins installed on it:

On the collector:

grep USER1 etc/resource.cfg
$USER1$=/usr/local/nagios/libexec

On the two pollers:

$USER1$=/usr/local/nagios/libexec
$USER1$=/usr/local/nagios/libexec

The plugins are installed in identical locations on all three systems,
that's enforced via Puppet. The 'nagios' user can find and run them on
the collector:

/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
NRPE v2.13

Now, because this is a distributed setup, the collector system is
not configured to run active checks:

grep ^execute etc/nagios.cfg
execute_service_checks=0
execute_host_checks=0

... but *obviously* it's trying to. Is it failing because it's
configured to not run them? If that's the case, the error message is
not accurate and should be corrected. If that's *not* the case, why
can't my collector server run an active check when it believes it needs
to?

I use NConf to generate my configurations, if that matters. There are
a *lot* of hosts/services and quite a few configuration files, so I'm not
going to paste a slew of information here. If I'm missing pertinent
information, please let me know exactly what you want to see and I'll
get it.

I'd really appreciate a clue-by-four. Thanks, folks! :)

Benny
--
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
-- #22 on Peter Anspach's Evil
Overlord list
C. Bensend
2013-08-28 11:48:09 UTC
Permalink
Post by C. Bensend
I'm continuing to iron out the wrinkles with 3.5.1 and distributed
monitoring. I'm using mod_gearman to submit and receive events from
two distributed pollers.
Every now and again, I'll get something similar in the log on the
CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
youre trying to run actually exists. (worker: collector.domain.org)
To me, that suggests that the collector system didn't get a result
for a host or service in a timely manner from one of the polling
systems, and so it attempted to run an active check itself. However,
it doesn't seem to be able to, and I don't know why.
The collector has the same value for $USER1$, and it has the same
grep USER1 etc/resource.cfg
$USER1$=/usr/local/nagios/libexec
$USER1$=/usr/local/nagios/libexec
$USER1$=/usr/local/nagios/libexec
The plugins are installed in identical locations on all three systems,
that's enforced via Puppet. The 'nagios' user can find and run them on
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
NRPE v2.13
Now, because this is a distributed setup, the collector system is
grep ^execute etc/nagios.cfg
execute_service_checks=0
execute_host_checks=0
... but *obviously* it's trying to. Is it failing because it's
configured to not run them? If that's the case, the error message is
not accurate and should be corrected. If that's *not* the case, why
can't my collector server run an active check when it believes it needs
to?
I use NConf to generate my configurations, if that matters. There are
a *lot* of hosts/services and quite a few configuration files, so I'm not
going to paste a slew of information here. If I'm missing pertinent
information, please let me know exactly what you want to see and I'll
get it.
No one has an idea about this? And no, Andreas, I can't move to
4.0 yet. ;)

Thanks!

Benny
--
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
-- #22 on Peter Anspach's Evil
Overlord list
Justin Pryzby
2013-08-28 11:54:03 UTC
Permalink
Do you get many of those error messages in the logs at once, or just
one at a time?

Only one thought: what are the permissions on your $USER$ variables?
Nagios on my systems setuid() to nonroot after startup, and if it gets
SIGHUP to reload config, but can't read the file defining $USER*$,
will act strangely.

Justin
Post by C. Bensend
Post by C. Bensend
I'm continuing to iron out the wrinkles with 3.5.1 and distributed
monitoring. I'm using mod_gearman to submit and receive events from
two distributed pollers.
Every now and again, I'll get something similar in the log on the
CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
youre trying to run actually exists. (worker: collector.domain.org)
To me, that suggests that the collector system didn't get a result
for a host or service in a timely manner from one of the polling
systems, and so it attempted to run an active check itself. However,
it doesn't seem to be able to, and I don't know why.
The collector has the same value for $USER1$, and it has the same
grep USER1 etc/resource.cfg
$USER1$=/usr/local/nagios/libexec
$USER1$=/usr/local/nagios/libexec
$USER1$=/usr/local/nagios/libexec
The plugins are installed in identical locations on all three systems,
that's enforced via Puppet. The 'nagios' user can find and run them on
/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1
NRPE v2.13
Now, because this is a distributed setup, the collector system is
grep ^execute etc/nagios.cfg
execute_service_checks=0
execute_host_checks=0
... but *obviously* it's trying to. Is it failing because it's
configured to not run them? If that's the case, the error message is
not accurate and should be corrected. If that's *not* the case, why
can't my collector server run an active check when it believes it needs
to?
I use NConf to generate my configurations, if that matters. There are
a *lot* of hosts/services and quite a few configuration files, so I'm not
going to paste a slew of information here. If I'm missing pertinent
information, please let me know exactly what you want to see and I'll
get it.
No one has an idea about this? And no, Andreas, I can't move to
4.0 yet. ;)
Thanks!
Benny
--
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
-- #22 on Peter Anspach's Evil
Overlord list
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Nagios-users mailing list
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
C. Bensend
2013-08-28 12:08:52 UTC
Permalink
Post by Justin Pryzby
Do you get many of those error messages in the logs at once, or just
one at a time?
Only one thought: what are the permissions on your $USER$ variables?
Nagios on my systems setuid() to nonroot after startup, and if it gets
SIGHUP to reload config, but can't read the file defining $USER*$,
will act strangely.
Just one at a time, seemingly randomly. A host here, a service there,
several times a day. They always almost immediately recover, but I
don't understand why my centralized collector seems to have this issue.

Nagios runs as the nagios user, which can read the resource.cfg file
fine:

ls -ld . ; ls -l nagios-hostname.cfg resource.cfg
drwxrwx--- 6 root nagios 4096 Aug 27 16:02 .
-rw-r--r-- 1 root root 47606 Jul 1 11:18 nagios-hostname.cfg
-rw-r----- 1 root nagios 2400 Mar 19 11:25 resource.cfg

Thanks!
--
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
-- #22 on Peter Anspach's Evil
Overlord list
Sven Nierlein
2013-08-28 12:26:24 UTC
Permalink
Post by C. Bensend
CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
youre trying to run actually exists. (worker: collector.domain.org)
Hi,

if this is the collector host, why does it have a mod-gearman worker installed? If nagios would have
run the check by itself, there would be no hint about the worker in the error. So it seems like there
is a worker started on your collector host which then grabs some checks but isn't able to execute them.

Regards,
Sven
--
Sven Nierlein ***@consol.de
ConSol* GmbH http://www.consol.de
Franziskanerstrasse 38 Tel.:089/45841-439
81669 Muenchen Fax.:089/45841-111
C. Bensend
2013-08-28 12:43:28 UTC
Permalink
Post by Sven Nierlein
Post by C. Bensend
CRITICAL: Return code of 127 is out of bounds. Make sure the plugin
youre trying to run actually exists. (worker: collector.domain.org)
Hi,
if this is the collector host, why does it have a mod-gearman worker
installed? If nagios would have
run the check by itself, there would be no hint about the worker in the
error. So it seems like there
is a worker started on your collector host which then grabs some checks
but isn't able to execute them.
Oh ho! I have multiple *gearman* processes running:

ps axuwwwwww | grep gearman
gearmand 5662 0.7 0.1 404672 2496 ? Ssl Aug17 118:29
/usr/sbin/gearmand -d -l /var/log/gearmand/gearmand.log
nagios 5712 0.0 0.0 38024 640 ? Ss Aug17 1:03
/usr/bin/mod_gearman_worker -d
--config=/etc/mod_gearman/mod_gearman_worker.conf
--pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 25919 0.0 0.1 137492 3016 ? S 07:38 0:00
/usr/bin/mod_gearman_worker -d
--config=/etc/mod_gearman/mod_gearman_worker.conf
--pidfile=/var/mod_gearman/mod_gearman_worker.pid

.. etc ..

Are you saying I just need gearmand running on the collector? I'm
quite new to gearman, so I might have misunderstood which parts are
necessary where. I can easily shut down the mod_gearman_worker
service, I just need to understand the consequences.

I assumed that this was a Nagios error - perhaps I just have my
gearman setup configured wrong.

Benny
--
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
-- #22 on Peter Anspach's Evil
Overlord list
Sven Nierlein
2013-08-28 13:07:37 UTC
Permalink
Post by C. Bensend
Are you saying I just need gearmand running on the collector?
Well, i assumed it. You are the only one which really can tell that.
You will need a worker on each host which should run checks. If your
collector should not run any checks, than no worker is necessary.

See http://labs.consol.de/nagios/mod-gearman/#_common_scenarios for a list
of common setups.

Sven
C. Bensend
2013-08-28 13:34:24 UTC
Permalink
Post by Sven Nierlein
Post by C. Bensend
Are you saying I just need gearmand running on the collector?
Well, i assumed it. You are the only one which really can tell that.
You will need a worker on each host which should run checks. If your
collector should not run any checks, than no worker is necessary.
See http://labs.consol.de/nagios/mod-gearman/#_common_scenarios for a list
of common setups.
OK, yes, I grok that. I guess I would want the collector to be *able*
to run checks, if it doesn't get timely information from the pollers.
I'm assuming that's why it's even trying in the first place - it
doesn't see a result in a timely manner, so it thinks it should run
one.

Which circles back to my original question - why can't it run the
check? Why isn't it finding what it needs to find? The workers
are running as the nagios user, and I don't see anything that appears
pertinent in the mod_gearman_worker.conf file... What am I missing?
Neither the gearmand.log nor the mod_gearman_worker.log files seem
to have any complaints (but I haven't bumped up the debug on them yet).

Thanks so much for your help!

Benny
--
"No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head."
-- #22 on Peter Anspach's Evil
Overlord list
Loading...