[Nagios-users] check_cpu

Discussion:

[Nagios-users] check_cpu_usage plugin?

ffwqe efwa

2007-03-29 20:49:16 UTC

Simple question -- is there a plugin or easy method to check for processor
usage over either a 1 minute or 5 minute average on linux machines? There's
a check_load plugin, but I can't find anything that would easily function as
a check_cpu plugin.

Andy Shellam

2007-03-29 20:57:41 UTC

Permalink

That's exactly what the load average is - it shows how busy the
processor is/has been over the last 1 minute, 5 minutes and 15 minutes.

http://en.wikipedia.org/wiki/Load_(computing)

Andy.

Post by ffwqe efwa
Simple question -- is there a plugin or easy method to check for
processor usage over either a 1 minute or 5 minute average on linux
machines? There's a check_load plugin, but I can't find anything that
would easily function as a check_cpu plugin.
!DSPAM:37,460c269a103001795015003!
------------------------------------------------------------------------
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
!DSPAM:37,460c269a103001795015003!
------------------------------------------------------------------------
_______________________________________________
Nagios-users mailing list
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
!DSPAM:37,460c269a103001795015003!

ffwqe efwa

2007-03-29 21:10:27 UTC

Permalink

Andy,

I was under the impression that load was more indicative on process-wait for
cpu cycles and general system performance than actual cpu usage. For
example, heavy processor wait on disk IO could cause high load without
actually consuming CPU usage, amongst other things. Am I incorrect?

If I can rely on check_load to return .50 in a 5 minute average if 50% of
the cpu is used for five minute straight this should work for me needs. My
gut feeling tells me it never quite works this way. Additionally, can't a
momentary amount of high load (greater than 1.0 or 100% cpu) cause a number
above 1/100% to be averaged in and cause these measurements to be off?

Although it could be nice to monitor cpus individually in a multi-cpu system
for hung processes, etc, I can live without this if check_load works nicely.

Thanks!

Post by Andy Shellam
That's exactly what the load average is - it shows how busy the
processor is/has been over the last 1 minute, 5 minutes and 15 minutes.
http://en.wikipedia.org/wiki/Load_(computing)
Andy.

-------------------------------------------------------------------------

Post by ffwqe efwa
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share

your

Post by ffwqe efwa
opinions on IT & business topics through brief surveys-and earn cash

http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

Post by ffwqe efwa
!DSPAM:37,460c269a103001795015003!
------------------------------------------------------------------------
_______________________________________________
Nagios-users mailing list
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when

reporting any issue.

Post by ffwqe efwa
::: Messages without supporting info will risk being sent to /dev/null
!DSPAM:37,460c269a103001795015003!

Thomas Stocking

2007-03-29 22:43:00 UTC

Permalink

#!/usr/bin/perl -w
#
# $Id: check_cpu,v 1.7 2004/06/22 16:51:05 hmann Exp $
#
# check_cpu checks CPU and returns sar stats
#
# Copyright (c) 2000-2004 Harper Mann. All rights reserved
# (***@comcast.net)
#
# No warrenty of any kind is granted io implied.
#
# Change Log
#----------------
# 15-Feb-2004 - Harper Mann
# Initial revision
# 19-Mar-2004 Harper Mann
# Fixed to use /proc instead of sar
# 23-Mar-2004 Harper Mann
# Back to sar so numbers match. Beta release
# 4-Jun-2004 - Harper Mann
# Changed idle value to be a < test
# 21-Jun-2004 - Harper Mann
# Added switch for CPU
#
use strict;

my @sar_vals = undef;
my @lines = undef;
my @res = undef;

my $PROCSTAT = "/proc/stat";

my $cpu;
my $userwarn = 101;
my $usercrit = 101;
my $nicewarn = 101;
my $nicecrit = 101;
my $syswarn = 101;
my $syscrit = 101;
my $idlewarn = 0;
my $idlecrit = 0;

my $debug = 0;
my $perf = 0;

my $SAR = "/usr/local/groundwork/bin/sar";

use Getopt::Long;
use vars qw($opt_V $opt_c $opt_s $opt_n $opt_u $opt_i $opt_D $opt_p $opt_h);
use vars qw($PROGNAME);
use lib "/usr/local/groundwork/nagios/libexec" ;
use utils qw($TIMEOUT %ERRORS &print_revision &support &usage);

sub print_help ();
sub print_usage ();

$PROGNAME = "check_cpu";

Getopt::Long::Configure('bundling');
my $status = GetOptions
( "V" => \$opt_V, "Version" => \$opt_V,
"c=s" => \$opt_c, "cpu=s" => \$opt_c,
"u=s" => \$opt_u, "user=s" => \$opt_u,
"n=s" => \$opt_n, "nice=s" => \$opt_n,
"s=s" => \$opt_s, "system=s" => \$opt_s,
"i=s" => \$opt_i, "idle=s" => \$opt_i,
"D" => \$opt_D, "debug" => \$opt_D,
"p" => \$opt_p, "performance" => \$opt_p,
"h" => \$opt_h, "help" => \$opt_h );

if ($status == 0) { print_usage() ; exit $ERRORS{'UNKNOWN'}; }

# Debug switch
if ($opt_D) { $debug = 1 }

# Cpu switch
if ($opt_c) { $cpu = $opt_c } else { $cpu = 0 }
print "CPU: $cpu\n" if $debug;

# Performance switch
if ($opt_p) { $perf = 1; }

# Version
if ($opt_V) {
print_revision($PROGNAME,'$Revision: 1.7 $');
exit $ERRORS{'OK'};
}

if ($opt_h) {print_help(); exit $ERRORS{'UNKNOWN'}}

# Options checking
# Percent CPU system utilization
if ($opt_s) {
($syswarn, $syscrit) = split /:/, $opt_s;

($syswarn && $syscrit) || usage ("missing value -s <warn:crit>\n");

($syswarn =~ /^\d{1,3}$/ && $syswarn > 0 && $syswarn <= 100) &&
($syscrit =~ /^\d{1,3}$/ && $syscrit > 0 && $syscrit <= 100) ||
usage("Invalid value: -s <warn:crit> (system percent): $opt_s\n");

($syscrit > $syswarn) ||
usage("system critical (-s $opt_s <warn:crit>) must be > warning\n");
}

# Percent CPU nice utilization
if ($opt_n) {
($nicewarn, $nicecrit) = split /:/, $opt_n;

($nicewarn && $nicecrit) || usage ("missing value -n <warn:crit>\n");

($nicewarn =~ /^\d{1,3}$/ && $nicewarn > 0 && $nicewarn <= 100) &&
($nicecrit =~ /^\d{1,3}$/ && $nicecrit > 0 && $nicecrit <= 100) ||
usage("Invalid value: -n <warn:crit> (nice percent): $opt_n\n");

($nicecrit > $nicewarn) ||
usage("nice critical (-n $opt_n <warn:crit>) must be > warning\n");
}

# Percent CPU user utilzation
if ($opt_u) {
($userwarn, $usercrit) = split /:/, $opt_u;

($userwarn && $usercrit) || usage ("missing value -u <warn:crit>\n");

($userwarn =~ /^\d{1,3}$/ && $userwarn > 0 && $userwarn <= 100) &&
($usercrit =~ /^\d{1,3}$/ && $usercrit > 0 && $usercrit <= 100) ||
usage("Invalid value: -u <warn:crit> (user percent): $opt_u\n");

($usercrit > $userwarn) ||
usage("user critical (-u $opt_u <warn:crit>) must be < warning\n");
}

# Percent CPU idle utilzation
if ($opt_i) {
($idlewarn, $idlecrit) = split /:/, $opt_i;

($idlewarn && $idlecrit) || usage ("missing value -i <warn:crit>\n");

($idlewarn =~ /^\d{1,3}$/ && $idlewarn > 0 && $idlewarn <= 100) &&
($idlecrit =~ /^\d{1,3}$/ && $idlecrit > 0 && $idlecrit <= 100) ||
usage("Invalid value: -i <warn:crit> (idle percent): $opt_i\n");

($idlecrit < $idlewarn) ||
usage("idle critical (-i $opt_i <warn:crit>) must be > warning\n");
}

# Read /proc/stat values. The first "cpu " line has aggregate values if
# the system is SMP
#

my ($lbl, $user, $nice, $sys, $idle) = undef;
if ($cpu eq "ALL" ) {
(@res = qx/$SAR 1/) || die "No output from sar: $!";
} else {
(@res = qx/$SAR 1 -U $cpu/) || die "No output from sar: $!";
}
foreach (@res) {
chomp;
($lbl,$cpu,$user,$nice,$sys,$idle) = split(/\s+/);
if (/average/) { last }
}

# Do the value checks
my $out = undef;
$out=$out."(cpu: $cpu) ";

$out=$out."user: $user";
($user > $usercrit) ? ($out=$out."(Critical) ") :
($user > $userwarn) ? ($out=$out."(Warning) ") : ($out=$out."(OK) ");

$out=$out."nice: $nice";
($nice > $nicecrit) ? ($out=$out."(Critical) ") :
($nice > $nicewarn) ? ($out=$out."(Warning) ") : ($out=$out."(OK) ");

$out=$out."sys: $sys";
($sys > $syscrit) ? ($out=$out."(Critical) ") :
($sys > $syswarn) ? ($out=$out."(Warning) ") : ($out=$out."(OK) ");

$out=$out."idle: $idle";
($idle < $idlecrit) ? ($out=$out."(Critical) ") :
($idle < $idlewarn) ? ($out=$out."(Warning) ") : ($out=$out."(OK) ");

print "$out";

print " |cpu: $cpu user: $user nice: $nice sys: $sys idle: $idle\n" if $perf;

# Plugin output
# $worst == $ERRORS{'OK'} ? print "CPU OK @goodlist" : print "@badlist";

# Performance?

if ($out =~ /Critical/) { exit (2) }
if ($out =~ /Warning/) { exit (1) }
exit (0); #OK

# Usage sub
sub print_usage () {
print "Usage: $PROGNAME
[-c, --cpu <cpu number>
[-u, --user <warn:crit> percent
[-n, --nice <warn:crit> percent
[-s, --system <warn:crit> percent
[-i, --idle <warn:crit> percent (NOTE: idle less than x)
[-p] (output Nagios performance data)
[-D] (debug) [-h] (help) [-V] (Version)\n";
}

# Help sub
sub print_help () {
print_revision($PROGNAME,'$Revision: 1.7 $');

# Perl device CPU check plugin for Nagios

print_usage();
print "
-c, --cpu
CPU Number (default is 0, ALL for all)
-u, --user
Percent CPU user
-n, --nice
Percent CPU nice
-s, --system=STRING
Percent CPU system
-i, --idle
If less than Percent CPU idle
-p, --performance
Report Nagios performance data after the ouput string
-h, --help
Print help
-V, --Version
Print version of plugin
";

}

Thomas Guyot-Sionnest

2007-03-30 03:04:16 UTC

Permalink

Actually, the exact meaning is the average number of process on the CPU
run queue. For example if you have two processes running constantly (ex.
number-crunching applications) you load average will be always 2 (or
higher if there's other programs running).

Since the run queue does not have a upper limit like the CPU usage
percent, the meaning is slightly different. For example on a quad
processor system that runs many process the load can climb pretty high
without the CPU being fully utilized because at some times you might
have 200 processes on the run queue, and at other times there will be
none, so the average will be high.

The average CPU usage, on the other hand, can't be higher that 100%.
This means that when you approach the upper limit under normal load you
know you're hitting the limit of your server, so this might be a much
more precise indication that you need a faster server.

About the check question, I don't know of any standalone check beside
INTEGER SNMP OIDs:

UCD-SNMP-MIB::ssCpuUser.0 (.1.3.6.1.4.1.2021.11.9.0)
UCD-SNMP-MIB::ssCpuSystem.0 (.1.3.6.1.4.1.2021.11.10.0)
UCD-SNMP-MIB::ssCpuIdle.0 (.1.3.6.1.4.1.2021.11.11.0)

However, these are not very precise and to better monitor the CPU usage
I suggest to graph (using MRTG, Cacti and the like) the CPU usage from
the Raw values (These are counters that have to be polled and then the
average usage is calculated based on the time between pools). You can
optionally write a check script that will then extract the 1-minute,
5-minute, etc. average from the RRD file using rrdtools. Cacti has a
plugin for that if you don't know how to do it, though you won't have
1-minute pooling (even though Cacti has a 1-minute patch it requires
much more modifications to have it work properly)

Raw OIDs that can be pooled (there may be more or less depending on the
system:

UCD-SNMP-MIB::ssCpuRawUser.0 (.1.3.6.1.4.1.2021.11.50.0)
UCD-SNMP-MIB::ssCpuRawNice.0 (.1.3.6.1.4.1.2021.11.51.0)
UCD-SNMP-MIB::ssCpuRawSystem.0 (.1.3.6.1.4.1.2021.11.52.0)
UCD-SNMP-MIB::ssCpuRawIdle.0 (.1.3.6.1.4.1.2021.11.53.0)
UCD-SNMP-MIB::ssCpuRawWait.0 (.1.3.6.1.4.1.2021.11.54.0)
UCD-SNMP-MIB::ssCpuRawKernel.0 (.1.3.6.1.4.1.2021.11.55.0)
UCD-SNMP-MIB::ssCpuRawInterrupt.0 (.1.3.6.1.4.1.2021.11.56.0)
UCD-SNMP-MIB::ssCpuRawSoftIRQ.0 (.1.3.6.1.4.1.2021.11.61.0)

Thomas

Post by ffwqe efwa
Andy,
I was under the impression that load was more indicative on process-wait
for cpu cycles and general system performance than actual cpu usage.
For example, heavy processor wait on disk IO could cause high load
without actually consuming CPU usage, amongst other things. Am I
incorrect?
If I can rely on check_load to return .50 in a 5 minute average if 50%
of the cpu is used for five minute straight this should work for me
needs. My gut feeling tells me it never quite works this way.
Additionally, can't a momentary amount of high load (greater than 1.0 or
100% cpu) cause a number above 1/100% to be averaged in and cause these
measurements to be off?
Although it could be nice to monitor cpus individually in a multi-cpu
system for hung processes, etc, I can live without this if check_load
works nicely.
Thanks!
That's exactly what the load average is - it shows how busy the
processor is/has been over the last 1 minute, 5 minutes and 15 minutes.
http://en.wikipedia.org/wiki/Load_(computing)
Andy.

Aaron M. Segura

2007-03-29 20:59:51 UTC

Permalink

SNMP-based and otherwise...

http://www.nagiosexchange.org/Search_Projects.43.0.html?tx_netnagext_pi1%5Bphrase%5D=cpu&tx_netnagext_pi1%5Bsubmit%5D=search&tx_netnagext_pi1%5Bsearch%5D=1

Hugo van der Kooij

2007-03-30 06:33:11 UTC

Permalink

Something like the CPU usage seen on my server here?
http://arwen.waakhond.net/#Sensors

I read the values with snmp and feed them to the rrdtools. They do the
averaging for me. Now all you need to do is read the right value from the
RRD file.

Hugo.

--
***@vanderkooij.org http://hugo.vanderkooij.org/
This message is using 100% recycled electrons.

Some men see computers as they are and say "Windows"
I use computers with Linux and say "Why Windows?"
(Thanks JFK, for the insight.)