Discussion:
[Nagios-users] Question about host checks
Simone Felici
2008-11-25 09:04:18 UTC
Permalink
Hi to all!

I'm using nagios for a long time and this is the first time I ask a question... it means until today all is gone
perfectly, great product!
A week ago I've installed 'ndoutils-1.4b7' to log into a MySQL server.
After some days I've seen the table 'nagios_hostchecks' is groving up very fast, its 3 times bigger than
'nagios_servicechecks' even if I've 688 hosts and 1346 services, also more services than hosts.
Analyzing the content of the 'nagios_hostchecks' table, I've seen It seems Nagios is checking the host status VERY OFTEN.
Also checking the nagios Queue and the "Last Check Time" of my hosts, the date/time is updating every few seconds.
Should not nagios check the host state only on demand when needed? It happens on all my hosts.
I paste an example of one host configuration:


host...

define host {
host_name <cut_my_host>
alias Windows 2000 Server
address <cut_my_ip>
use host_alarm_A
parents <cut_my_sw_parent>
contact_groups noc-group,tech24h-group
}


...template...

define host {
name host_alarm_A
process_perf_data 0
retain_status_information 1
flap_detection_enabled 0
retain_nonstatus_information 1
active_checks_enabled 1
passive_checks_enabled 0
check_period 24hx7
obsess_over_host 0
check_freshness 0
check_command check-host-alive
max_check_attempts 2
event_handler_enabled 0
notifications_enabled 1
notification_interval 3600
notification_period 24hx7
notification_options d,u,r
register 0
}

...check

define command {
command_name check-host-alive
command_line /usr/local/nagios/libexec/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c
5000.0,100% -p 1
}


Any suggestion?

Thank's!

Simon
--
Simone Felici E-Mail: ***@alpikom.it
Divisione Tecnica Tel: 0461 030 111
Alpikom S.p.A. Fax: 0461 030 112
v.Fersina, 23 - 38100 Trento URL: http://www.alpikom.it
Simone Felici
2008-11-26 07:54:33 UTC
Permalink
Please help, noone has an idea?
Ah, my Nagios version (Nagios 3.0.3).
Also why all my hosts are checked more or less every 4 seconds? :(
Thank's!

simon
Post by Simone Felici
Hi to all!
I'm using nagios for a long time and this is the first time I ask a question... it means until today all is gone
perfectly, great product!
A week ago I've installed 'ndoutils-1.4b7' to log into a MySQL server.
After some days I've seen the table 'nagios_hostchecks' is groving up very fast, its 3 times bigger than
'nagios_servicechecks' even if I've 688 hosts and 1346 services, also more services than hosts.
Analyzing the content of the 'nagios_hostchecks' table, I've seen It seems Nagios is checking the host status VERY OFTEN.
Also checking the nagios Queue and the "Last Check Time" of my hosts, the date/time is updating every few seconds.
Should not nagios check the host state only on demand when needed? It happens on all my hosts.
host...
define host {
host_name <cut_my_host>
alias Windows 2000 Server
address <cut_my_ip>
use host_alarm_A
parents <cut_my_sw_parent>
contact_groups noc-group,tech24h-group
}
...template...
define host {
name host_alarm_A
process_perf_data 0
retain_status_information 1
flap_detection_enabled 0
retain_nonstatus_information 1
active_checks_enabled 1
passive_checks_enabled 0
check_period 24hx7
obsess_over_host 0
check_freshness 0
check_command check-host-alive
max_check_attempts 2
event_handler_enabled 0
notifications_enabled 1
notification_interval 3600
notification_period 24hx7
notification_options d,u,r
register 0
}
...check
define command {
command_name check-host-alive
command_line /usr/local/nagios/libexec/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c
5000.0,100% -p 1
}
Any suggestion?
Thank's!
Simon
Marc Powell
2008-11-26 14:44:06 UTC
Permalink
Post by Simone Felici
Please help, noone has an idea?
Ah, my Nagios version (Nagios 3.0.3).
Also why all my hosts are checked more or less every 4 seconds? :(
Thank's!
The information provided so far indicates that hosts will only be
checked on demand, as you expect. That means, probably, either the
information is incorrect or they're being checked on demand. Since the
normal interval for actions in nagios is measured in minutes, I'd lean
toward the on-demand side of things.

Can you post the host definition from status.dat.
Can you post relevant log entries for the host and any services on
that host near the host checks. You may need to increase your logging
options in nagios.cfg.
Is the 4 second number in any way significant to your installation? Is
your time_interval less than 60?
Debug mode is available to you to figure out what's going on. This is
almost certainly going to be your best source for resolution.

--
Marc
Simone Felici
2008-11-27 08:36:10 UTC
Permalink
Post by Marc Powell
Post by Simone Felici
Please help, noone has an idea?
Ah, my Nagios version (Nagios 3.0.3).
Also why all my hosts are checked more or less every 4 seconds? :(
Thank's!
The information provided so far indicates that hosts will only be
checked on demand, as you expect. That means, probably, either the
information is incorrect or they're being checked on demand. Since the
normal interval for actions in nagios is measured in minutes, I'd lean
toward the on-demand side of things.
Can you post the host definition from status.dat.
Can you post relevant log entries for the host and any services on
that host near the host checks. You may need to increase your logging
options in nagios.cfg.
Is the 4 second number in any way significant to your installation? Is
your time_interval less than 60?
Debug mode is available to you to figure out what's going on. This is
almost certainly going to be your best source for resolution.
Good morning.
With time_interval do you mean interval_lenght?
I've set it to "1". In this way I've set all checks in seconds, because I need for certain services a retry interval of
30seconds.

Here additional infos:

########################################
# NAGIOS STATUS FILE
#
# THIS FILE IS AUTOMATICALLY GENERATED
# BY NAGIOS. DO NOT MODIFY THIS FILE!
########################################

<<cut>>

hoststatus {
host_name=<<MY-WINDOWS-EXAMPLE-SERVER>>
modified_attributes=3
check_command=check-host-alive
check_period=24hx7
notification_period=24hx7
check_interval=5.000000
retry_interval=1.000000
event_handler=
has_been_checked=1
should_be_scheduled=1
check_execution_time=0.016
check_latency=12.595
check_type=0
current_state=0
last_hard_state=0
last_event_id=116739
current_event_id=116740
current_problem_id=0
last_problem_id=51304
plugin_output=PING OK - Packet loss = 0%, RTA = 0.65 ms
long_plugin_output=
performance_data=
last_check=1227773285
next_check=1227773291
check_options=0
current_attempt=1
max_attempts=2
current_event_id=116740
last_event_id=116739
state_type=1
last_state_change=1227714561
last_hard_state_change=1227714561
last_time_up=1227773286
last_time_down=1227714537
last_time_unreachable=1215613491
last_notification=0
next_notification=0
no_more_notifications=0
current_notification_number=0
current_notification_id=46804
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
active_checks_enabled=1
passive_checks_enabled=0
event_handler_enabled=0
flap_detection_enabled=0
failure_prediction_enabled=1
process_performance_data=0
obsess_over_host=0
last_update=1227773300
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}

<<cut>>

Then I've enabled debugging (24) and here the result pasting only where I've found the example host.
I've written "(..CUT..)" to skip MB of lines not important, referring to other services/hosts (having of course the same
problem).


(..CUT..)
[1227774166.211254] [008.0] [pid=25062] ** Timed Event ** Type: 12, Run Time: Thu Nov 27 09:22:38 2008
[1227774166.211266] [008.0] [pid=25062] ** Host Check Event ==> Host: '<<MY-WINDOWS-EXAMPLE-SERVER>>', Options: 0,
Latency: 8.211000 sec
[1227774166.211283] [016.0] [pid=25062] Attempting to run scheduled check of host '<<MY-WINDOWS-EXAMPLE-SERVER>>': check
options=0, latency=8.211000
[1227774166.211296] [016.0] [pid=25062] ** Running async check of host '<<MY-WINDOWS-EXAMPLE-SERVER>>'...
[1227774166.211325] [016.0] [pid=25062] Checking host '<<MY-WINDOWS-EXAMPLE-SERVER>>'...
[1227774166.211434] [016.1] [pid=25062] Check result output will be written to '/tmp/checkNmYAPl' (fd=7)
[1227774166.229697] [008.1] [pid=25062] ** Event Check Loop
[1227774166.229755] [008.1] [pid=25062] Next High Priority Event Time: Thu Nov 27 09:22:47 2008
[1227774166.229771] [008.1] [pid=25062] Next Low Priority Event Time: Thu Nov 27 09:22:38 2008
[1227774166.229781] [008.1] [pid=25062] Current/Max Service Checks: 0/80
[1227774166.229794] [008.1] [pid=25062] Running event...
[1227774166.229808] [008.0] [pid=25062] ** Timed Event ** Type: 12, Run Time: Thu Nov 27 09:22:38 2008
(..CUT..)
[1227774186.288210] [016.1] [pid=6021] Checking host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
(..CUT..)
[1227774186.433737] [016.1] [pid=6021] Checking service 'DISK-SPACE' on host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
[1227774186.433749] [016.1] [pid=6021] Service is not flapping (0.00% state change).
[1227774186.433894] [016.1] [pid=6021] Checking service 'IMAP' on host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
[1227774186.433905] [016.1] [pid=6021] Service is not flapping (0.00% state change).
[1227774186.434036] [016.1] [pid=6021] Checking service 'MEMORY' on host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
[1227774186.434047] [016.1] [pid=6021] Service is not flapping (0.00% state change).
[1227774186.434179] [016.1] [pid=6021] Checking service 'PING' on host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
[1227774186.434190] [016.1] [pid=6021] Service is not flapping (0.00% state change).
[1227774186.434323] [016.1] [pid=6021] Checking service 'POP3' on host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
[1227774186.434334] [016.1] [pid=6021] Service is not flapping (0.00% state change).
[1227774186.434464] [016.1] [pid=6021] Checking service 'SMTP' on host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
[1227774186.434475] [016.1] [pid=6021] Service is not flapping (0.00% state change).
(..CUT..)
[1227774190.773319] [016.1] [pid=6021] Handling check result for host '<<MY-WINDOWS-EXAMPLE-SERVER>>'...
[1227774190.773330] [016.1] [pid=6021] ** Handling async check result for host '<<MY-WINDOWS-EXAMPLE-SERVER>>'...
[1227774190.773345] [016.1] [pid=6021] HOST: <<MY-WINDOWS-EXAMPLE-SERVER>>, ATTEMPT=1/2, CHECK TYPE=ACTIVE, STATE
TYPE=HARD, OLD STATE=0, NEW STATE=0
[1227774190.773372] [016.1] [pid=6021] Host was UP.
[1227774190.773382] [016.1] [pid=6021] Host is still UP.
[1227774190.773392] [016.1] [pid=6021] Pre-handle_host_state() Host: <<MY-WINDOWS-EXAMPLE-SERVER>>, Attempt=1/2,
Type=HARD, Final State=0
[1227774190.773414] [016.1] [pid=6021] Post-handle_host_state() Host: <<MY-WINDOWS-EXAMPLE-SERVER>>, Attempt=1/2,
Type=HARD, Final State=0
[1227774190.773430] [016.1] [pid=6021] Checking host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
[1227774190.773452] [016.1] [pid=6021] Rescheduling next check of host at Thu Nov 27 09:23:15 2008
[1227774190.773481] [016.0] [pid=6021] Scheduling a non-forced, active check of host '<<MY-WINDOWS-EXAMPLE-SERVER>>' @
Thu Nov 27 09:23:15 2008
[1227774190.773500] [016.1] [pid=6021] ** Async check result for host '<<MY-WINDOWS-EXAMPLE-SERVER>>' handled: new state=0
[1227774190.773524] [016.1] [pid=6021] Deleted check result file '/usr/local/nagios/var/spool/checkresults/ctdW9xA'
(..CUT..)
[1227774197.619431] [008.0] [pid=6021] ** Timed Event ** Type: 12, Run Time: Thu Nov 27 09:23:08 2008
[1227774197.619443] [008.0] [pid=6021] ** Host Check Event ==> Host: '<<MY-WINDOWS-EXAMPLE-SERVER>>', Options: 0,
Latency: 9.619000 sec
[1227774197.619460] [016.0] [pid=6021] Attempting to run scheduled check of host '<<MY-WINDOWS-EXAMPLE-SERVER>>': check
options=0, latency=9.619000
[1227774197.619473] [016.0] [pid=6021] ** Running async check of host '<<MY-WINDOWS-EXAMPLE-SERVER>>'...
[1227774197.619504] [016.0] [pid=6021] Checking host '<<MY-WINDOWS-EXAMPLE-SERVER>>'...
[1227774197.619601] [016.1] [pid=6021] Check result output will be written to '/tmp/checkICrgKX' (fd=7)
[1227774197.636440] [008.1] [pid=6021] ** Event Check Loop
[1227774197.636524] [008.1] [pid=6021] Next High Priority Event Time: Thu Nov 27 09:23:18 2008
[1227774197.636544] [008.1] [pid=6021] Next Low Priority Event Time: Thu Nov 27 09:23:08 2008
[1227774197.636558] [008.1] [pid=6021] Current/Max Service Checks: 0/80
[1227774197.636573] [008.1] [pid=6021] Running event...
(..CUT..)
[1227774198.041822] [016.1] [pid=6021] Handling check result for host '<<MY-WINDOWS-EXAMPLE-SERVER>>'...
[1227774198.041833] [016.1] [pid=6021] ** Handling async check result for host '<<MY-WINDOWS-EXAMPLE-SERVER>>'...
[1227774198.041851] [016.1] [pid=6021] HOST: <<MY-WINDOWS-EXAMPLE-SERVER>>, ATTEMPT=1/2, CHECK TYPE=ACTIVE, STATE
TYPE=HARD, OLD STATE=0, NEW STATE=0
[1227774198.041863] [016.1] [pid=6021] Host was UP.
[1227774198.041872] [016.1] [pid=6021] Host is still UP.
[1227774198.041881] [016.1] [pid=6021] Pre-handle_host_state() Host: <<MY-WINDOWS-EXAMPLE-SERVER>>, Attempt=1/2,
Type=HARD, Final State=0
[1227774198.041893] [016.1] [pid=6021] Post-handle_host_state() Host: <<MY-WINDOWS-EXAMPLE-SERVER>>, Attempt=1/2,
Type=HARD, Final State=0
[1227774198.041904] [016.1] [pid=6021] Checking host '<<MY-WINDOWS-EXAMPLE-SERVER>>' for flapping...
[1227774198.041933] [016.1] [pid=6021] Rescheduling next check of host at Thu Nov 27 09:23:23 2008
[1227774198.041955] [016.0] [pid=6021] Scheduling a non-forced, active check of host '<<MY-WINDOWS-EXAMPLE-SERVER>>' @
Thu Nov 27 09:23:23 2008
[1227774198.042001] [016.1] [pid=6021] ** Async check result for host '<<MY-WINDOWS-EXAMPLE-SERVER>>' handled: new state=0
[1227774198.042025] [016.1] [pid=6021] Deleted check result file '/usr/local/nagios/var/spool/checkresults/cb
(..CUT..)

It's enough?
Any helps?

Thank's!!!

Simon
Marc Powell
2008-11-27 15:53:38 UTC
Permalink
Post by Simone Felici
Good morning.
With time_interval do you mean interval_lenght?
I did, sorry about that.
Post by Simone Felici
I've set it to "1". In this way I've set all checks in seconds,
because I need for certain services a retry interval of 30seconds.
Good to know.
Post by Simone Felici
########################################
# NAGIOS STATUS FILE
#
# THIS FILE IS AUTOMATICALLY GENERATED
# BY NAGIOS. DO NOT MODIFY THIS FILE!
########################################
<<cut>>
hoststatus {
host_name=<<MY-WINDOWS-EXAMPLE-SERVER>>
modified_attributes=3
check_command=check-host-alive
check_period=24hx7
notification_period=24hx7
check_interval=5.000000
Nagios is configured to check this host every 5 seconds (5 x
interval_length). If you meant this to be 5 minutes, the value should
be 300.

-
Marc
Simone Felici
2008-11-27 17:00:07 UTC
Permalink
Post by Marc Powell
Post by Simone Felici
Good morning.
With time_interval do you mean interval_lenght?
I did, sorry about that.
Post by Simone Felici
I've set it to "1". In this way I've set all checks in seconds,
because I need for certain services a retry interval of 30seconds.
Good to know.
Post by Simone Felici
########################################
# NAGIOS STATUS FILE
#
# THIS FILE IS AUTOMATICALLY GENERATED
# BY NAGIOS. DO NOT MODIFY THIS FILE!
########################################
<<cut>>
hoststatus {
host_name=<<MY-WINDOWS-EXAMPLE-SERVER>>
modified_attributes=3
check_command=check-host-alive
check_period=24hx7
notification_period=24hx7
check_interval=5.000000
Nagios is configured to check this host every 5 seconds (5 x
interval_length). If you meant this to be 5 minutes, the value should
be 300.
-
Marc
Thank's Marc, but let me understood... Should Nagios perform host checks only if needed?
I.e. if a service goes critical or change state?
Or are true both questions?

Simon
Marc Powell
2008-11-27 22:33:43 UTC
Permalink
Post by Simone Felici
Thank's Marc, but let me understood... Should Nagios perform host checks only if needed?
I.e. if a service goes critical or change state?
Or are true both questions?
In your configuration, nagios will do both active and on-demand
checks. Nagios will normally do on-demand checks just by specifying
'active_checks_enabled 1'. When you also specified 'check_interval 5',
you told nagios that it should perform regularly timed checks of the
host as well. If you don't want regularly checks, leave out the
check_interval directive entirely.

I'm not sure where that's coming from since it wasn't in your original
posting of the host definition or template. Was it ever set that way
or currently set like that and left out of your original posting?

--
Marc
Simone Felici
2008-11-28 14:19:57 UTC
Permalink
Post by Marc Powell
Post by Simone Felici
Thank's Marc, but let me understood... Should Nagios perform host checks only if needed?
I.e. if a service goes critical or change state?
Or are true both questions?
In your configuration, nagios will do both active and on-demand
checks. Nagios will normally do on-demand checks just by specifying
'active_checks_enabled 1'. When you also specified 'check_interval 5',
you told nagios that it should perform regularly timed checks of the
host as well. If you don't want regularly checks, leave out the
check_interval directive entirely.
I'm not sure where that's coming from since it wasn't in your original
posting of the host definition or template. Was it ever set that way
or currently set like that and left out of your original posting?
--
Marc
In my first post I've pasted the configuration from hosts.cfg and hosts_templates.cfg.
It means from original configuration.
The second post I've pasted the conten of the status file. NOW I've NO IDEA where the check_interval directive is taken.
Do you have any idea?
Should I set it to "x" seconds to be shure che active check will be done exactly every "x" seconds (example 300)?
...expect, of course, on-demand checks.
Because I've no idea where nagios take the check_interval directive from the configuration. Maybe setting it to "0" does
the trick? :)

Thank's

simon
Simone Felici
2008-12-03 08:48:52 UTC
Permalink
Post by Marc Powell
Post by Simone Felici
Thank's Marc, but let me understood... Should Nagios perform host checks only if needed?
I.e. if a service goes critical or change state?
Or are true both questions?
In your configuration, nagios will do both active and on-demand
checks. Nagios will normally do on-demand checks just by specifying
'active_checks_enabled 1'. When you also specified 'check_interval 5',
you told nagios that it should perform regularly timed checks of the
host as well. If you don't want regularly checks, leave out the
check_interval directive entirely.
I'm not sure where that's coming from since it wasn't in your original
posting of the host definition or template. Was it ever set that way
or currently set like that and left out of your original posting?
--
good Morning Marc,

Found the solution.
The check_interval seems CANNOT be skipped (missed) into .cfg file or it will be set automatically.
Have no idea where it takes the default "5secs" value.
Setting it to "0", active checks are still enabled and the value to "0" prevent to schedule regular checks.
On demand checks are done without problems.

Also problem solved!

Thank's anyway for pointing me to the right place!

Simon

Loading...