Discussion:
[Nagios-users] High CPU utilization at random times
Dan Wilson
2005-10-05 02:41:53 UTC
Permalink
I've been looking into a problem for quite some time now and have come
up stumped. Every time I think I know what the problem is I turn out to
be wrong.

Sorry, this is LONG but has lots of detail, hopefully all the detail you
guys need to make a diagnosis or point me in the right direction :-)

PROBLEM:
Randomly, and for no good reason, the CPU usage on this machine will go
up to anywhere from .7 to 1.5!?!?!?!?!?

HARDWARE:
PIII 677
384MB ram
Software RAID 1 with IDE(all partitions except swap, yes, I boot from it
too... I already took crap for booting from software raid, but it works
fine, really)
extra drive for swap and nightly "snapshots" of /usr/local/ and /etc and
a few other things.

SOFTWARE:
Mandrake linux 10.1(last updates 45 days ago)
Nagios 1.2 (no perl interpreter, with perl cache)
Plugins 1.3.1
Optional/custom plugins...
check_icmp instead of check_ping
custom check_ink script/plugin - this plugin is written in perl and uses
the netsnmp module for perl. This isn't the problem either, stopped all
service checks that used it for a few hours, the problem was still
there.... FYI: This script checks supply levels in network printers, I
could have used the check_snmp plugin for this but that was too messy(i
tried!). This way the out put is cleaner(ex. Levels OK - C-34% Y-75%
M-12% K-90%) and there is only one check per printer instead of one for
each supply :-) [my programming skills suck, really, they do. You have
to specify the type of printer which has to be put in the script so if
can correctly read the supplies... I should have written it to
"explore" the printer to see what kind of supplies it had and what could
be checked so it would in theory work with any printer... but it works
the way it is, and I couldn't figure out how to get everything to
work... I'm learning and will some day get it to work the way I want????]
check_smart - checks HDD SMART values... not the trouble either, it was
added recently after a HDD went bad and the box crashed 2 nights in a
row(the extra drive was bad and failed during the "snapshot")

The follwing were the latest stable versions as of about Feb-2005
Apache
MRTG
NetSNMP
PERL
PHP
MySQL


THINGS I HAVE DONE/LOOKED AT TO TRY AND FIX THIS ISSUE:

Recompiled the kernel... no change, went back to the standard kernel.

Restarted like a MS machine... uptime makes no difference, pleanty of
memory availble(150+MB) all the time

Nagios - stopped the service, no issue, start the service and let it run
a while, the problem appears... I recompiled(twice), adjusted a few
options, no luck with the issue though nagios ran a tiny faster, maybe
1-2%, not worth the wait to recompile IMHO

MRTG - checking interface on 2 routers, it is using RRD and the
MRTG-RRD.CGI fast cgi script so the load from this every 5 minutes isn't
even worth mentioning. Tried removing access from users to stop
MRTG-RRD.CGI from generating graphs on demand. I even tried stopping
MRTG and lost 4 hours of data but still had the problem.

Apache - stopped the service, problem still continues.

PERL - recompiled and removed a few options that the documentation said
could cause trouble, no change. Even ran Nagios without any perl
scripts/plugins, problem still there.

PHP - nothing is using this at the moment... was only installed for
testing a Nagios config utility with a web interface...

MySQL - not being used, makes no difference if it is running or not.

I only run X while downloading updates, otherwise it stays off and I
just SSH in.


MORE INFO:

At first I only noticed it when I would SSH in and look at the load
because it took 15+seconds to log in. I though it was SSH to I started
having Nagios check the CPU load, I can look from time to time and catch
it up nice and high.

It is NOT logs being rotated, excessive swaping, bad hardware(second
machine it's happened on), too many people accessing the box, too many
services/hosts down.(I'm checking about 90 hosts and 180+ services,
after I delete the retention data and start Nagios fresh everything is
checked and fine in 2 minutes or less.).

It's not to the point where the box is unusable, it clears up in a
minute or two(always, every time, and that makes it hard to track down).

It is NOT(at least not that I can tell) Nagios making excessive retries
on problems, it happens when there are no problem and I have the max
retries set to 3 for all but a few things. Timeouts are 10 seconds or
less on all but one check. I'm not using obssesive checks, processing
perf data or anything like that.

When I first installed nagios 2 years ago I tinkered with getting it to
respond faster, I set the time period to 15 seconds(default is 60?) so I
could get a few things running every 15 or 30 seconds... works great and
with little increased overhead.... I just have to remember that 1
minute is now 4 and not 1... ;-) Nagios responds like a champ now,
forced checks don't take a minute or longer... 20 seconds at the
longest. I HATE WAITING! LOL




Any ideas? Or should I just live with it until I upgrade to 2.0? I'll
be moving to faster hardware then anyway, dual PIII 700 with 2GB ram and
hardware RAID1... It's not much but it is better :-)
Chris Wilson
2005-10-05 07:23:26 UTC
Permalink
Hi Dan,
Post by Dan Wilson
Nagios - stopped the service, no issue, start the service and let it run
a while, the problem appears... I recompiled(twice), adjusted a few
options, no luck with the issue though nagios ran a tiny faster, maybe
1-2%, not worth the wait to recompile IMHO
Just to check I understood correctly, stopping Nagios is the ONLY thing
that you've found so far that makes the problem go away?

Try removing services from your Nagios configuration in batches, to
narrow down which one(s) are causing the problem.

Cheers, Chris.
--
(aidworld) chris wilson | chief engineer (***@aidworld.org)
Scot Jenkins
2005-10-05 13:40:17 UTC
Permalink
Post by Dan Wilson
I've been looking into a problem for quite some time now and have come
up stumped. Every time I think I know what the problem is I turn out to
be wrong.
Sorry, this is LONG but has lots of detail, hopefully all the detail you
guys need to make a diagnosis or point me in the right direction :-)
Randomly, and for no good reason, the CPU usage on this machine will go
up to anywhere from .7 to 1.5!?!?!?!?!?
A load average of 1.5 is not really all that high, especially if it's
not sustained for any lengthy period of time. I've seen a heavily
loaded Linux mail server running with a load average of about 30, and
a FreeBSD system with a load average of over 100 (Apache went nuts
spawning CGI scripts). Also keep in mind the load average is report
for the last 1, 5, and 15 minutes; man uptime(1) for details. In
which field are you seeing the 1.5 load average?

Are you tracking disk IO some where? sar and iostat (part of the
sysstat package) are good tools for this task. You might want to
track and compare disk and CPU to see if they're related. Since
you're running software RAID it could be that disk IO is causing the
CPU spike.

Check the Nagios "trends" CGI output and compare that with the output
from other tools: vmstat, top (real-time), sar, iostat (real-time and
historical) to get a feel for what is normal for your system.

Scot
Guy B. Purcell
2005-10-05 20:19:14 UTC
Permalink
Post by Dan Wilson
I've been looking into a problem for quite some time now and have
come up stumped. Every time I think I know what the problem is I
turn out to be wrong.
Randomly, and for no good reason, the CPU usage on this machine
will go up to anywhere from .7 to 1.5!?!?!?!?!?
I'm assuming these are load average numbers, not CPU utilization
percentages or something else. (This problem seems a tad off-topic
for this list, since it really doesn't seem to be related to Nagios
other than that Nagios is reporting seemingly unusual load. Have you
asked a Linux UG for suggestions?)
Post by Dan Wilson
PIII 677
384MB ram
Software RAID 1 with IDE(all partitions except swap, yes, I boot
from it too... I already took crap for booting from software raid,
but it works fine, really)
extra drive for swap and nightly "snapshots" of /usr/local/ and /
etc and a few other things.
I don't see a problem at all (at least it wouldn't be on a Solaris
box; not sure what the load avg. numbers under Mandrake mean): on a
box that's doing software RAID & running the Nagios server, you
should expect to see some load, on average; and I wouldn't worry
about loads up to twice the number of CPUs in the box for brief
periods (again, at least not running Solaris, where "load average"
means the number of processes in the run queue--including those on
CPU, as well as those hanging out waiting on some I/O to complete).

However, if this box truly is doing nothing and you still see high
loads--especially for prolonged periods--perhaps there is a problem.
Try shutting down Nagios & any other daemons you don't need (eg.
sendmail) for a while and checking the 'sar' logs for load bumps. If
you still see load when there shouldn't be any, you may have been
hacked (although by someone not very competent if s/he allowed load
from their hidden activities to show).

-Guy
Andreas Ericsson
2005-10-06 04:24:57 UTC
Permalink
Post by Dan Wilson
I've been looking into a problem for quite some time now and have come
up stumped. Every time I think I know what the problem is I turn out to
be wrong.
Sorry, this is LONG but has lots of detail, hopefully all the detail you
guys need to make a diagnosis or point me in the right direction :-)
Randomly, and for no good reason, the CPU usage on this machine will go
up to anywhere from .7 to 1.5!?!?!?!?!?
PIII 677
384MB ram
Software RAID 1 with IDE(all partitions except swap, yes, I boot from it
too... I already took crap for booting from software raid, but it works
fine, really)
extra drive for swap and nightly "snapshots" of /usr/local/ and /etc and
a few other things.
Mandrake linux 10.1(last updates 45 days ago)
Nagios 1.2 (no perl interpreter, with perl cache)
I don't think you could have the perl cache without the perl interpreter...
Post by Dan Wilson
Plugins 1.3.1
Optional/custom plugins...
check_icmp instead of check_ping
Early incantations of check_icmp could end up in an infinite loop if it
timed out and entered the finish() function. This ofcourse ups the load
no end, until Nagios kills it off with SIGKILL. Try upgrading it from
the package at http://oss.op5.se/nagios/op5plugins-2005-09-27.tar.gz

AFAIK, this bug was only ever present in a version of check_icmp which
specifically wasn't intended for production use, but was tested by a
number of friendly helpers (all mentioned in check_icmp.c).
Post by Dan Wilson
custom check_ink script/plugin - this plugin is written in perl and uses
the netsnmp module for perl. This isn't the problem either, stopped all
service checks that used it for a few hours, the problem was still
there.... FYI: This script checks supply levels in network printers, I
could have used the check_snmp plugin for this but that was too messy(i
tried!). This way the out put is cleaner(ex. Levels OK - C-34% Y-75%
M-12% K-90%) and there is only one check per printer instead of one for
each supply :-) [my programming skills suck, really, they do. You have
to specify the type of printer which has to be put in the script so if
can correctly read the supplies... I should have written it to
"explore" the printer to see what kind of supplies it had and what could
be checked so it would in theory work with any printer... but it works
the way it is, and I couldn't figure out how to get everything to
work... I'm learning and will some day get it to work the way I want????]
check_smart - checks HDD SMART values... not the trouble either, it was
added recently after a HDD went bad and the box crashed 2 nights in a
row(the extra drive was bad and failed during the "snapshot")
The follwing were the latest stable versions as of about Feb-2005
Apache
MRTG
NetSNMP
PERL
PHP
MySQL
Recompiled the kernel... no change, went back to the standard kernel.
Restarted like a MS machine... uptime makes no difference, pleanty of
memory availble(150+MB) all the time
This seems to indicate an infinite loop problem in some small piece of
software then. Believe me, it can eat load *fast*.
Post by Dan Wilson
Nagios - stopped the service, no issue, start the service and let it run
a while, the problem appears... I recompiled(twice), adjusted a few
options, no luck with the issue though nagios ran a tiny faster, maybe
1-2%, not worth the wait to recompile IMHO
Did you happen to notice if this coincided with a host going down or in
some other way not being able to respond to ping? The host check (or
ping service check) output would be something along the lines of "Plugin
timed out" if it was down to check_icmp.
Post by Dan Wilson
MRTG - checking interface on 2 routers, it is using RRD and the
MRTG-RRD.CGI fast cgi script so the load from this every 5 minutes isn't
even worth mentioning. Tried removing access from users to stop
MRTG-RRD.CGI from generating graphs on demand. I even tried stopping
MRTG and lost 4 hours of data but still had the problem.
Apache - stopped the service, problem still continues.
PERL - recompiled and removed a few options that the documentation said
could cause trouble, no change. Even ran Nagios without any perl
scripts/plugins, problem still there.
PHP - nothing is using this at the moment... was only installed for
testing a Nagios config utility with a web interface...
MySQL - not being used, makes no difference if it is running or not.
I only run X while downloading updates, otherwise it stays off and I
just SSH in.
At first I only noticed it when I would SSH in and look at the load
because it took 15+seconds to log in. I though it was SSH to I started
having Nagios check the CPU load, I can look from time to time and catch
it up nice and high.
It is NOT logs being rotated, excessive swaping, bad hardware(second
machine it's happened on), too many people accessing the box, too many
services/hosts down.(I'm checking about 90 hosts and 180+ services,
after I delete the retention data and start Nagios fresh everything is
checked and fine in 2 minutes or less.).
It's not to the point where the box is unusable, it clears up in a
minute or two(always, every time, and that makes it hard to track down).
It is NOT(at least not that I can tell) Nagios making excessive retries
on problems, it happens when there are no problem and I have the max
retries set to 3 for all but a few things. Timeouts are 10 seconds or
less on all but one check. I'm not using obssesive checks, processing
perf data or anything like that.
When I first installed nagios 2 years ago I tinkered with getting it to
respond faster, I set the time period to 15 seconds(default is 60?) so I
could get a few things running every 15 or 30 seconds... works great and
with little increased overhead.... I just have to remember that 1
minute is now 4 and not 1... ;-) Nagios responds like a champ now,
forced checks don't take a minute or longer... 20 seconds at the
longest. I HATE WAITING! LOL
Any ideas? Or should I just live with it until I upgrade to 2.0? I'll
be moving to faster hardware then anyway, dual PIII 700 with 2GB ram and
hardware RAID1... It's not much but it is better :-)
-------------------------------------------------------
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nagios-users mailing list
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue. ::: Messages without supporting info will risk
being sent to /dev/null
--
Andreas Ericsson ***@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Continue reading on narkive:
Loading...