Discussion:
[Nagios-users] Patches for inclusion to Nagios 4
Ton Voon
2013-08-06 17:16:14 UTC
Permalink
Hi!

We've published a list of patches for Nagios 4: http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4

We'd be happy if you could review if these are acceptable for future inclusion or if anyone else finds them useful.

Ton
Daniel Wittenberg
2013-08-07 02:44:49 UTC
Permalink
Looks like some good additions. Since I'm in the middle of doing some in-depth performance tuning/comparison between 3.2.3 and 4.0 I might to throw a couple of these on and see how things change.

Thanks!
Dan
Post by Ton Voon
Hi!
We've published a list of patches for Nagios 4: http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4
We'd be happy if you could review if these are acceptable for future inclusion or if anyone else finds them useful.
Ton
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Nagios-devel mailing list
https://lists.sourceforge.net/lists/listinfo/nagios-devel
Andreas Ericsson
2013-08-07 07:51:37 UTC
Permalink
Post by Ton Voon
Hi!
http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4
We'd be happy if you could review if these are acceptable for future
inclusion or if anyone else finds them useful.
I'd like to get patches with commit messages and proper author and
signed-off-by info. Since we're using git for Nagios now, it'd go
a long way in making sure everyone gets credit for the work they've
done.

The patches also need to apply cleanly to the latest master.

You may want to clone the official Nagios repo, apply your patches on
top of it and then send me a pull request for github or some such.

git clone git://git.code.sf.net/p/nagios/nagios nagios-core

should get you the very latest. If you apply your patches on top of
'master' and make sure to always do "git pull --rebase" when you want
to get the latest and greatest you'll quickly see which patches either
have been applied or which no longer *can* be applied. Then you can
create a separate repository on github or some such and push the changes
there.


On a first inspection though;
* Don't comment out code. Bringing back dead code is what the VCS is
for, and keeping it around is just plain dumb. If it shouldn't be in
there, just remove it. In the same vein; Don't add commented-out code.

* Avoid C++ comments. I know they're supported in C99, which I'm rooting
for as the least version supported, but it's against the current style.

* Don't mix spaces and tabs for indentation, unless it's continuation-
indentation of a multi-line statement or condition. Stick to the style
in the file you're editing, and *look* at the patch before you send it
somewhere.

* Avoid comments saying things like "Opsview specific foobar" if you
want to have the patches included. If you *don't* want those patches
included, don't send them to me or point me to where they are. It takes
up a lot of time to remove crap like that, and I have no interest in
cleaning up after anyone else. I'm messy enough as it is on my own.

* Don't augment objects (such as hosts, services, commands) with new
variables. Doing so means Nagios 4.1, and I can't take patches like
that until Nagios 4 is at least released as stable. All objects have
an 'id' field which means you can look up any extra data you want in
O(1) time, provided you just initialize an array of size
num_objects.$desired_object_type_in_plural before we enter the event
loop.

* Make patches the most scalable you can. For the "check_time_period"
thing, you'd be far better off adding code to detect timeperiod changes,
notice which timeperiods are used to change commands and make a one-off
swap for all the affected commands as the desired timeperiod comes
either in or out of effect.

* Don't build on deprecated technology, such as external files for
commands and/or check results.
--
Andreas Ericsson ***@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
Andreas Ericsson
2013-08-08 15:52:22 UTC
Permalink
Post by Andreas Ericsson
Post by Ton Voon
Hi!
http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4
We'd be happy if you could review if these are acceptable for
future inclusion or if anyone else finds them useful.
I'd like to get patches with commit messages and proper author and
signed-off-by info. Since we're using git for Nagios now, it'd go a
long way in making sure everyone gets credit for the work they've
done.
The patches also need to apply cleanly to the latest master.
You may want to clone the official Nagios repo, apply your patches
on top of it and then send me a pull request for github or some
such.
git clone git://git.code.sf.net/p/nagios/nagios nagios-core
should get you the very latest. If you apply your patches on top
of 'master' and make sure to always do "git pull --rebase" when you
want to get the latest and greatest you'll quickly see which
patches either have been applied or which no longer *can* be
applied. Then you can create a separate repository on github or
some such and push the changes there.
OK, we'll convert them into git changes based off master.
However....
I'd like an assurance that the changes will be merged before we
promise to convert. Style changes are fine, won't take much time and
will not require retesting, but if we need to refactor object changes
or make larger changes to logic, this will require retesting on our
side. I'd like a reassurance that the time invested will result in a
merge upstream, otherwise we're just wasting time for all of us.
For instance, your assistance in the "environment macros per-command"
was greatly appreciated and we've coded to the design agreed in the
email conversations, but it hasn't been merged yet.
It has, but it hasn't. The original patch no longer applies, due to
extensive remodeling of the worker code. I still have it hanging
around though, and I'm fairly inclined towards using it for services
and hosts as well, so that environment macros can be set on a service
level as well as on a command level. I've also been investigating the
chances of doing this without resorting to setenv() (in essence,
building a one-off block of the load-time environment macros, and then
extending that whenever we hit an environment variable).

But yea, I've got that one already.
So I'd like to reverse the question and ask "which changes are most
likely to be accepted, based on the amount of changes required" and
we'll work through that list in order.
Thread-safe calls
It's harmless, and I'd apply it straight away if I had a sensible
commit message for it, but I don't know why you need this so I can't
really write one. While we're on the topic; Please write commit
messages in imperative form and present tense, as if you're giving
orders to the code for how it should change. Also give a short
rationale for why it should change that way. The rules about not
commenting out code you want to get rid of still applies though. It
just looks terribly hackish when patches meant for upstream contains
things like that.


Slice services within hosts
So long as it's configurable from cgi.cfg and the default stays the
same as it is today I'll apply it immediately. It has no impact on
loadable modules or other headaches, so that's a nobrainer, really.


Check command by time period
I feel this is somewhat lacking in efficiency and flexibility, and
a much cleaner solution would be to add a filtering functionality to
NERD so that checkresults can be shipped off to a third-party addon
which can transform checkresults and plugin output as much as it
likes.
Failing that (which would be enormously cool but also a lot of work),
I think it would most likely be best off as a separate module, with
custom variables or separate configuration to support it. Supporting
patches to run events when a timeperiod becomes active or inactive
would still be welcome, obviously.


Escalation via notification levels
This is best off written as a module, using custom variables to
configure it. If core support is required to block notifications to a
particular contact, then such patches will naturally be accepted. The
normal NEBCALLBACK_CANCEL return code signalling should work just fine
for that.


Synchronising state data
Pretty invasive for quite a small benefit, and with enormous
complexities to deal with to make this work properly. How do you
handle reading the synced file while checks are in-flight and
awaiting being returned to the mothership? They may have been
spawned as a result of a bad checkresult on one node, but if the
sync status returns them to OK and the in-flight result sets them
as bad again, would that mean starting over on the retry-attempts?
I see nothing in the patch that handles situations like that, so
I'm forced to believe you haven't considered it. Considering the
fact that full reloads now take less than a second for your
recommended 250 hosts per system, I'm inclined to drop this patch
in favour of the regular retention data file unless you can
persuade me otherwise.


Dependency failure states
The patch is a bit hackish in that it internally submits a passive
check to make sure everything gets updated. That's not entirely
accurate though, and will cause statistical errors.
Rewrite it to just update the states as necessary and make sure
notifications and whatnot are sent if they should be.
That will also fix the problem of bugging out performance data
handling routines that expect to always get the same kind of data
from each check every time.
It will have to be configurable though. I'm thinking
"service_dependency_exec_fail_state=(warning,critical,unknown)"
and "service_dependency_exec_fail_output=Dependency failure".
I'd be all for changing the default, but for now the exec_fail_state
thing will have to be -1, meaning "leave it as-is", and I'll make
it default to UNKNOWN after 4.0 is released.
The same should go for service parents, btw.


Passive notifications respecting renotification interval
I'm inclined to making this default. It's not documented anywhere
which of those options take precedence. If notification_interval
is 0, then every notification should be sent for is_volatile
services, and if it's anything else we should respect the interval
no matter how many notifications happen to arrive. Make those
changes and I'll apply the patch and call it a bugfix.


Reducing SQL queries part 1:
No. Just no. This is a perfect place where you can use an array sized
after num_objects.services *inside the module* and look up the latest
version of the processed commandline there. Free it when you update it
and you'll be sure to always have it handy. Anyone who doesn't use
NDOUtils will otherwise be penalized for its shortcomings.


Reducing SQL queries part 2:
As above, but num_objects.hosts


Conditional debugging:
I'm more inclined towards making the debug-logging function a C99-style
variadic macro (or a gcc style one), while falling back to using the
current structure of a lot of unnecessary calls on too old compilers.
All that iffing should really be kept at arms length from the coder,
or debug-logging won't be used. I'm thinking
dbg(DEBUGL_CHECKS, 0, "** Running async check of foo ... ");

so we get it short and sweet while we're at it.


FIFO processing of passive checkresults
I have several objections to this patch.
First off; You're sorting on alphanumerical name only, instead of mtime
first and then name in case of a tiebreak. That's just plain dumb and
has to be amended if this is to be anywhere near useful.
Secondly; This doesn't scale. I spent the better part of last year
removing everything that touches the filesystem on a scheduled and
frequent basis for those precise reasons. Putting crap like that back in
is not something I'll do without a very good reason.
Thirdly; It builds on obsolete code which is designated for oblivion,
so the code would be (relatively) shortlived anyway.
Fourthly; *Nothing* can be built on top of such a patch that doesn't
absolutely reek of hackishness and bad design.
A much more useful patch would be adding support for external commands
to the query handler, which doesn't have the limitations of the named
pipe (and can give responses to your requests so you know if they were
accepted or not) and use a secondary program to ship in passive results
any old way you want to the clean, nice and responsive interface which
is meant to be extended for things exactly like this.


Tolerant time jumps
I'm unsure how useful this is. How often does a system clock go
backwards except for daylight savings time? I also worry slightly
about the consequences. OTOH, most of the worry comes as a result from
the very lax default for time_change_threshold (900 seconds). A time
jump of 899 seconds wouldn't trigger a rescheduling, but would wreak
serious havoc with the checks.


Listing related contact groups
The patch needs to be changed so that the (possibly very long) list
of contactgroups isn't computed until it's requested instead of
before we even start macro substitution. Otherwise we'll be
computing that macro for each and every host and service check we
execute, which stupid beyond belief. Particularly considering the
fact that object config can't change during runtime (yet), so the
list could just as well be cached if one would so desire.
For bonus points, also add CONTACTLIST, which fetches contacts from
both the object directly and the contactgroups assigned.


Environment macros per-command
I wish you'd stop talking BS on that site of yours. Removal of
environment macros has very little to do with the speedups in Nagios 4.
Those improvements are the result of good design and proper
implementation of good design.
Now to technical issues; I have the patch (as I've said) and I do
intend to push it as soon as I've had time to polish it and bring it
forward to the current "master". It's quite invasive though, and
I've been trying to figure out what it is about it that feels wrong.
Then I've also forgotten about it, so this reminder was pretty good.
Thanks.


Allowing semi-colons in commands
Well, it's a neat idea, but the patch is so incredibly ugly that I
just can't bring myself to accept it. If you did something to keep
state and allowed semicolons in strings or as escaped creatures
(with a backslash, pretty please; There's enough home-cooked
escaping in Nagios already), I'd be willing to accept it.


Tilde symbol in commands
Absolutely out of the question. This patch deliberately breaks a
library to exploit its caller's fallback, which happens to suit your
particular needs in this case. No. Nonono. You can just configure
your way around it by using

command_line /bin/sh -c '$USER1$/plugin --arg=lala'

for the command in question, or implement it separately but making
it configurable as "expand_tilde_in_commands=<bool>" in nagios.cfg and
doing the replacement manually. It's not that hard, really.


Fix max_concurrent_checks decrementing
Obviously correct, although the patch is incomplete. In case we fail to
launch a job, it will have no reference in the scheduling queue but will
be marked as "is_executing" which will cause it to not even get
freshness checked. I'll amend the patch with everything it needs and
apply. Good catch though.


Extra external commands
This looks clean, correct and useful. I'll need a commit message, an
author and a signed-off-by before I can accept it though.


Thanks
--
Andreas Ericsson ***@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
bharat Varandani
2013-08-21 13:37:15 UTC
Permalink
Dear All,

        Can you tell me how i can use and see printer consumables status.

    Please response as soon as possible..


Thanks & Regards


Bharat Varandani



________________________________
From: Andreas Ericsson <***@op5.se>
To: Ton Voon <***@opsview.com>
Cc: Nagios Developers List <nagios-***@lists.sourceforge.net>; Nagios Users List <nagios-***@lists.sourceforge.net>
Sent: Thursday, August 8, 2013 7:52 PM
Subject: Re: [Nagios-users] [Nagios-devel] Patches for inclusion to Nagios 4
Post by Andreas Ericsson
Post by Ton Voon
Hi!
http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4
We'd be happy if you could review if these are acceptable for
future inclusion or if anyone else finds them useful.
I'd like to get patches with commit messages and proper author and
signed-off-by info. Since we're using git for Nagios now, it'd go a
long way in making sure everyone gets credit for the work they've
done.
The patches also need to apply cleanly to the latest master.
You may want to clone the official Nagios repo, apply your patches
on top of it and then send me a pull request for github or some
such.
git clone git://git.code.sf.net/p/nagios/nagios nagios-core
should get you the very latest. If you apply your patches on top
of 'master' and make sure to always do "git pull --rebase" when you
want to get the latest and greatest you'll quickly see which
patches either have been applied or which no longer *can* be
applied. Then you can create a separate repository on github or
some such and push the changes there.
OK, we'll convert them into git changes based off master.
However....
I'd like an assurance that the changes will be merged before we
promise to convert. Style changes are fine, won't take much time and
will not require retesting, but if we need to refactor object changes
or make larger changes to logic, this will require retesting on our
side. I'd like a reassurance that the time invested will result in a
merge upstream, otherwise we're just wasting time for all of us.
For instance, your assistance in the "environment macros per-command"
was greatly appreciated and we've coded to the design agreed in the
email conversations, but it hasn't been merged yet.
It has, but it hasn't. The original patch no longer applies, due to
extensive remodeling of the worker code. I still have it hanging
around though, and I'm fairly inclined towards using it for services
and hosts as well, so that environment macros can be set on a service
level as well as on a command level. I've also been investigating the
chances of doing this without resorting to setenv() (in essence,
building a one-off block of the load-time environment macros, and then
extending that whenever we hit an environment variable).

But yea, I've got that one already.
So I'd like to reverse the question and ask "which changes are most
likely to be accepted, based on the amount of changes required" and
we'll work through that list in order.
Thread-safe calls
It's harmless, and I'd apply it straight away if I had a sensible
commit message for it, but I don't know why you need this so I can't
really write one. While we're on the topic; Please write commit
messages in imperative form and present tense, as if you're giving
orders to the code for how it should change. Also give a short
rationale for why it should change that way. The rules about not
commenting out code you want to get rid of still applies though. It
just looks terribly hackish when patches meant for upstream contains
things like that.


Slice services within hosts
So long as it's configurable from cgi.cfg and the default stays the
same as it is today I'll apply it immediately. It has no impact on
loadable modules or other headaches, so that's a nobrainer, really.


Check command by time period
I feel this is somewhat lacking in efficiency and flexibility, and
a much cleaner solution would be to add a filtering functionality to
NERD so that checkresults can be shipped off to a third-party addon
which can transform checkresults and plugin output as much as it
likes.
Failing that (which would be enormously cool but also a lot of work),
I think it would most likely be best off as a separate module, with
custom variables or separate configuration to support it. Supporting
patches to run events when a timeperiod becomes active or inactive
would still be welcome, obviously.


Escalation via notification levels
This is best off written as a module, using custom variables to
configure it. If core support is required to block notifications to a
particular contact, then such patches will naturally be accepted. The
normal NEBCALLBACK_CANCEL return code signalling should work just fine
for that.


Synchronising state data
Pretty invasive for quite a small benefit, and with enormous
complexities to deal with to make this work properly. How do you
handle reading the synced file while checks are in-flight and
awaiting being returned to the mothership? They may have been
spawned as a result of a bad checkresult on one node, but if the
sync status returns them to OK and the in-flight result sets them
as bad again, would that mean starting over on the retry-attempts?
I see nothing in the patch that handles situations like that, so
I'm forced to believe you haven't considered it. Considering the
fact that full reloads now take less than a second for your
recommended 250 hosts per system, I'm inclined to drop this patch
in favour of the regular retention data file unless you can
persuade me otherwise.


Dependency failure states
The patch is a bit hackish in that it internally submits a passive
check to make sure everything gets updated. That's not entirely
accurate though, and will cause statistical errors.
Rewrite it to just update the states as necessary and make sure
notifications and whatnot are sent if they should be.
That will also fix the problem of bugging out performance data
handling routines that expect to always get the same kind of data
from each check every time.
It will have to be configurable though. I'm thinking
"service_dependency_exec_fail_state=(warning,critical,unknown)"
and "service_dependency_exec_fail_output=Dependency failure".
I'd be all for changing the default, but for now the exec_fail_state
thing will have to be -1, meaning "leave it as-is", and I'll make
it default to UNKNOWN after 4.0 is released.
The same should go for service parents, btw.


Passive notifications respecting renotification interval
I'm inclined to making this default. It's not documented anywhere
which of those options take precedence. If notification_interval
is 0, then every notification should be sent for is_volatile
services, and if it's anything else we should respect the interval
no matter how many notifications happen to arrive. Make those
changes and I'll apply the patch and call it a bugfix.


Reducing SQL queries part 1:
No. Just no. This is a perfect place where you can use an array sized
after num_objects.services *inside the module* and look up the latest
version of the processed commandline there. Free it when you update it
and you'll be sure to always have it handy. Anyone who doesn't use
NDOUtils will otherwise be penalized for its shortcomings.


Reducing SQL queries part 2:
As above, but num_objects.hosts


Conditional debugging:
I'm more inclined towards making the debug-logging function a C99-style
variadic macro (or a gcc style one), while falling back to using the
current structure of a lot of unnecessary calls on too old compilers.
All that iffing should really be kept at arms length from the coder,
or debug-logging won't be used. I'm thinking
  dbg(DEBUGL_CHECKS, 0, "** Running async check of foo ... ");

so we get it short and sweet while we're at it.


FIFO processing of passive checkresults
I have several objections to this patch.
First off; You're sorting on alphanumerical name only, instead of mtime
first and then name in case of a tiebreak. That's just plain dumb and
has to be amended if this is to be anywhere near useful.
Secondly; This doesn't scale. I spent the better part of last year
removing everything that touches the filesystem on a scheduled and
frequent basis for those precise reasons. Putting crap like that back in
is not something I'll do without a very good reason.
Thirdly; It builds on obsolete code which is designated for oblivion,
so the code would be (relatively) shortlived anyway.
Fourthly; *Nothing* can be built on top of such a patch that doesn't
absolutely reek of hackishness and bad design.
A much more useful patch would be adding support for external commands
to the query handler, which doesn't have the limitations of the named
pipe (and can give responses to your requests so you know if they were
accepted or not) and use a secondary program to ship in passive results
any old way you want to the clean, nice and responsive interface which
is meant to be extended for things exactly like this.


Tolerant time jumps
I'm unsure how useful this is. How often does a system clock go
backwards except for daylight savings time? I also worry slightly
about the consequences. OTOH, most of the worry comes as a result from
the very lax default for time_change_threshold (900 seconds). A time
jump of 899 seconds wouldn't trigger a rescheduling, but would wreak
serious havoc with the checks.


Listing related contact groups
The patch needs to be changed so that the (possibly very long) list
of contactgroups isn't computed until it's requested instead of
before we even start macro substitution. Otherwise we'll be
computing that macro for each and every host and service check we
execute, which stupid beyond belief. Particularly considering the
fact that object config can't change during runtime (yet), so the
list could just as well be cached if one would so desire.
For bonus points, also add CONTACTLIST, which fetches contacts from
both the object directly and the contactgroups assigned.


Environment macros per-command
I wish you'd stop talking BS on that site of yours. Removal of
environment macros has very little to do with the speedups in Nagios 4.
Those improvements are the result of good design and proper
implementation of good design.
Now to technical issues; I have the patch (as I've said) and I do
intend to push it as soon as I've had time to polish it and bring it
forward to the current "master". It's quite invasive though, and
I've been trying to figure out what it is about it that feels wrong.
Then I've also forgotten about it, so this reminder was pretty good.
Thanks.


Allowing semi-colons in commands
Well, it's a neat idea, but the patch is so incredibly ugly that I
just can't bring myself to accept it. If you did something to keep
state and allowed semicolons in strings or as escaped creatures
(with a backslash, pretty please; There's enough home-cooked
escaping in Nagios already), I'd be willing to accept it.


Tilde symbol in commands
Absolutely out of the question. This patch deliberately breaks a
library to exploit its caller's fallback, which happens to suit your
particular needs in this case. No. Nonono. You can just configure
your way around it by using

    command_line  /bin/sh -c '$USER1$/plugin --arg=lala'

for the command in question, or implement it separately but making
it configurable as "expand_tilde_in_commands=<bool>" in nagios.cfg and
doing the replacement manually. It's not that hard, really.


Fix max_concurrent_checks decrementing
Obviously correct, although the patch is incomplete. In case we fail to
launch a job, it will have no reference in the scheduling queue but will
be marked as "is_executing" which will cause it to not even get
freshness checked. I'll amend the patch with everything it needs and
apply. Good catch though.


Extra external commands
This looks clean, correct and useful. I'll need a commit message, an
author and a signed-off-by before I can accept it though.


Thanks
--
Andreas Ericsson                  ***@op5.se
OP5 AB                            www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Nagios-users mailing list
Nagios-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
Loading...