Discussion:
Ring topology parent/child relation Nagios
(too old to reply)
Mihai Tanasescu
2008-05-12 10:10:08 UTC
Permalink
Hello,



I have some problems defining the parent/child relationships to reflect
changes and monitoring on the map.


My topology is something like this:


Nagios machine --- Router A ---- Router B


Router B --- Router C --- Router D --- Router E ---Router F --- Router B
(ring closing itself)

but on the Router B ring I can't define parent relationships in a
circular way because nagios refuses to start when it detects this.

(I defined on Router C parent = Router B and Router D and so on for each
of them)

How should I configure them ?
Any ideas / help ?



Regards,
Mihai
Hugo van der Kooij
2008-05-12 12:23:36 UTC
Permalink
Mihai Tanasescu wrote:

| I have some problems defining the parent/child relationships to reflect
| changes and monitoring on the map.
|
| My topology is something like this:
|
| Nagios machine --- Router A ---- Router B
|
| Router B --- Router C --- Router D --- Router E ---Router F --- Router B
| (ring closing itself)
|
| but on the Router B ring I can't define parent relationships in a
| circular way because nagios refuses to start when it detects this.

The whole concept of a ring setup is that a single disaster can not
cause a network failure. For this setup I would only follow the ring
halfway.

So you get 2 chains:

Nagios --> A --> B --> C --> D
Nagios --> A --> B --> F --> E

Make sure you monitor each neighbor on each ring router to make sure the
ring is working as expected.

If you use dynamic routing you might want to monitor route changes
relevant for the proper operation of your ring setup.

Hugo.

- --
***@vanderkooij.org http://hugo.vanderkooij.org/
PGP/GPG? Use: http://hugo.vanderkooij.org/0x58F19981.asc

A: Yes.
Q: Are you sure?
A: Because it reverses the logical flow of conversation.
Q: Why is top posting frowned upon?
Bored? Click on http://spamornot.org/ and rate those images.
Mihai Tanasescu
2008-05-12 19:41:08 UTC
Permalink
Post by Hugo van der Kooij
| I have some problems defining the parent/child relationships to reflect
| changes and monitoring on the map.
|
|
| Nagios machine --- Router A ---- Router B
|
| Router B --- Router C --- Router D --- Router E ---Router F --- Router B
| (ring closing itself)
|
| but on the Router B ring I can't define parent relationships in a
| circular way because nagios refuses to start when it detects this.
The whole concept of a ring setup is that a single disaster can not
cause a network failure. For this setup I would only follow the ring
halfway.
Nagios --> A --> B --> C --> D
Nagios --> A --> B --> F --> E
Make sure you monitor each neighbor on each ring router to make sure the
ring is working as expected.
If you use dynamic routing you might want to monitor route changes
relevant for the proper operation of your ring setup.
Hugo.
Hello Hugo,


Thanks for the tip but I have one more question which refers to my
current problem in fact. (I configured sms sending for down events).

In case for example router B loses both its links to C and F (2
fibercuts on the network), then I will be getting SMSes stating that
C,D,F,E are down.
B in fact will not be down as a system but will be unable to reach the
others.


How could I solve this and avoid sending misleading sms messages
regarding down events?


Thanks,
Mihai
Hugo van der Kooij
2008-05-12 20:35:06 UTC
Permalink
Mihai Tanasescu wrote:
| Mihai Tanasescu wrote:
|> | I have some problems defining the parent/child relationships to reflect
|> | changes and monitoring on the map.
|> |
|> | My topology is something like this:
|> |
|> | Nagios machine --- Router A ---- Router B
|> |
|> | Router B --- Router C --- Router D --- Router E ---Router F ---
Router B
|> | (ring closing itself)
|> |
|> | but on the Router B ring I can't define parent relationships in a
|> | circular way because nagios refuses to start when it detects this.
|>
|> The whole concept of a ring setup is that a single disaster can not
|> cause a network failure. For this setup I would only follow the ring
|> halfway.
|>
|> So you get 2 chains:
|>
|> Nagios --> A --> B --> C --> D
|> Nagios --> A --> B --> F --> E
|>
|> Make sure you monitor each neighbor on each ring router to make sure the
|> ring is working as expected.
|>
|> If you use dynamic routing you might want to monitor route changes
|> relevant for the proper operation of your ring setup.

| Thanks for the tip but I have one more question which refers to my
| current problem in fact. (I configured sms sending for down events).
|
| In case for example router B loses both its links to C and F (2
| fibercuts on the network), then I will be getting SMSes stating that
| C,D,F,E are down.
| B in fact will not be down as a system but will be unable to reach the
| others.
|
| How could I solve this and avoid sending misleading sms messages
| regarding down events?

This problem should not exist.

Because if you cut the ring in 1 place all nodes can still be reached.
So no router will go down. If you cut it in 2 places you loose part of
the ring and only get alerts for the nodes directly on the other side of
the cuts from your perspective.

If you alert on unreachable as well then you get all the alerts you
tried to get rid of by introducing the parent relation in the first
place. So don't use them.

You need an additional means of detecting your first cut in the ring as
all routers can still be reached at that time and you will never know
you had a problem unless you alert on the actual link conditions.

Now getting the link condition to Nagios is something you need to work
out. Due to the lack of details it will be hard to help you there at the
moment. But considere the links to be the vital services for the host.

Hugo.

- --
***@vanderkooij.org http://hugo.vanderkooij.org/
PGP/GPG? Use: http://hugo.vanderkooij.org/0x58F19981.asc

A: Yes.
Q: Are you sure?
A: Because it reverses the logical flow of conversation.
Q: Why is top posting frowned upon?
Bored? Click on http://spamornot.org/ and rate those images.
Mihai Tanasescu
2008-05-12 20:55:24 UTC
Permalink
Post by Hugo van der Kooij
This problem should not exist.
Nagios --> Router A --> Router B uplink1+2 ring (and Router B is in a
ring topology which closes in it)

http://tinypic.com/view.php?pic=11uhx7a&s=3 (this is the logical layout)

Yes. But if you cut the 2 uplinks from Router B, then the Nagios machine
will see Router B as up but will not be able to reach any other router
from the ring and will thus alert that all other routers are down (which
is not true).
I mean having split the ring into the 2 halves you suggested that:
C has parent B, D has parent C, E has parent D
G has parent B, F has parent G
=> B up but B uplinks to C and G down -> alerts that C and G are down
although they aren't

Can this be eliminated ? (I'm sure the solution should be simple and
obvious but I'm not being as careful as I should to see it)


Am I right ?


P.S. Currently I am monitoring each link state (up/down) by using SNMP
interface queries (on Cisco routers) and the hosts themselves with
ping/icmp on loopback interfaces that are propagated throughout the
network for reachability(OSPF).
Post by Hugo van der Kooij
Because if you cut the ring in 1 place all nodes can still be reached.
So no router will go down. If you cut it in 2 places you loose part of
the ring and only get alerts for the nodes directly on the other side of
the cuts from your perspective.
If you alert on unreachable as well then you get all the alerts you
tried to get rid of by introducing the parent relation in the first
place. So don't use them.
You need an additional means of detecting your first cut in the ring as
all routers can still be reached at that time and you will never know
you had a problem unless you alert on the actual link conditions.
Mirza Dedic
2008-05-12 20:56:00 UTC
Permalink
Hello,

I know, totally off topic but what if you really wanted to? I want to monitor our Coffee Machine to warn me when it is running low (so that I can go there & put a new coffee in for some fresssh coffee).

Now I know it has nothing that Nagios can talk to; so my question does anyone know of a product you can attach to it that has network capabilities that Nagios can talk to? Lol

Thanks!

-----Original Message-----
From: nagios-users-***@lists.sourceforge.net [mailto:nagios-users-***@lists.sourceforge.net] On Behalf Of Mihai Tanasescu
Sent: May/12/2008 1:55 PM
To: Nagios Users Mailinglist
Subject: Re: [Nagios-users] Ring topology parent/child relation Nagios
Post by Hugo van der Kooij
This problem should not exist.
Nagios --> Router A --> Router B uplink1+2 ring (and Router B is in a
ring topology which closes in it)

http://tinypic.com/view.php?pic=11uhx7a&s=3 (this is the logical layout)

Yes. But if you cut the 2 uplinks from Router B, then the Nagios machine
will see Router B as up but will not be able to reach any other router
from the ring and will thus alert that all other routers are down (which
is not true).
I mean having split the ring into the 2 halves you suggested that:
C has parent B, D has parent C, E has parent D
G has parent B, F has parent G
=> B up but B uplinks to C and G down -> alerts that C and G are down
although they aren't

Can this be eliminated ? (I'm sure the solution should be simple and
obvious but I'm not being as careful as I should to see it)


Am I right ?


P.S. Currently I am monitoring each link state (up/down) by using SNMP
interface queries (on Cisco routers) and the hosts themselves with
ping/icmp on loopback interfaces that are propagated throughout the
network for reachability(OSPF).
Post by Hugo van der Kooij
Because if you cut the ring in 1 place all nodes can still be reached.
So no router will go down. If you cut it in 2 places you loose part of
the ring and only get alerts for the nodes directly on the other side of
the cuts from your perspective.
If you alert on unreachable as well then you get all the alerts you
tried to get rid of by introducing the parent relation in the first
place. So don't use them.
You need an additional means of detecting your first cut in the ring as
all routers can still be reached at that time and you will never know
you had a problem unless you alert on the actual link conditions.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
Jay R. Ashworth
2008-05-13 19:14:34 UTC
Permalink
Post by Mirza Dedic
I know, totally off topic but what if you really wanted to? I want to
monitor our Coffee Machine to warn me when it is running low (so that
I can go there & put a new coffee in for some fresssh coffee).
It's not OT, but you shouldn't have piggy backed it.

I would use either an optical sensor pointed through the pot position
at about the 20% height level, or a strain gage scale under the entire
coffeemaker, calibrated for the tare weight of the equipment.

Remember that either approach will false-positive when people pick up
the pot to pour coffee.

Urns work better in this environment.

Cheers,
-- jra
--
Jay R. Ashworth Baylink ***@baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com '87 e24
St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Those who cast the vote decide nothing.
Those who count the vote decide everything.
-- (Joseph Stalin)
Hugo van der Kooij
2008-05-12 21:34:26 UTC
Permalink
Mihai Tanasescu wrote:
|> This problem should not exist.
| Nagios --> Router A --> Router B uplink1+2 ring (and Router B is in a
| ring topology which closes in it)
|
| http://tinypic.com/view.php?pic=11uhx7a&s=3 (this is the logical layout)
|
| Yes. But if you cut the 2 uplinks from Router B, then the Nagios machine
| will see Router B as up but will not be able to reach any other router
| from the ring and will thus alert that all other routers are down (which
| is not true).
| I mean having split the ring into the 2 halves you suggested that:
| C has parent B, D has parent C, E has parent D
| G has parent B, F has parent G
| => B up but B uplinks to C and G down -> alerts that C and G are down
| although they aren't
|
| Can this be eliminated ? (I'm sure the solution should be simple and
| obvious but I'm not being as careful as I should to see it)

A ring config is a nightmare from the perspective of Nagios. The maths
simply do not work. The whole parent concept does not work for a ring.
The best you can do is some half way concept that will never show the
proper state in all cases.

Building a config to keep the amount of down reports to a minimum is not
a simple thing. The key is to cut thing in half and make sure you get
the timing right. Each node further away must wait longer to go from
soft fail to hard fail state. The manual handdles that subject and it is
mandatory to read it before you even try to use the parent feature.

So either spend many hours in perfecting a model to get a half way there
solution or accept the extra down reports and learn to interprete them
as an exact way of telling where you ring did break up.

There is no simple solution.

Hugo.

- --
***@vanderkooij.org http://hugo.vanderkooij.org/
PGP/GPG? Use: http://hugo.vanderkooij.org/0x58F19981.asc

A: Yes.
Q: Are you sure?
A: Because it reverses the logical flow of conversation.
Q: Why is top posting frowned upon?
Bored? Click on http://spamornot.org/ and rate those images.
Mirza Dedic
2008-05-12 21:02:33 UTC
Permalink
Hello,

I know, totally off topic but what if you really wanted to? I want to monitor our Coffee Machine to warn me when it is running low (so that I can go there & put a new coffee in for some fresssh coffee).

Now I know it has nothing that Nagios can talk to; so my question does anyone know of a product you can attach to it that has network capabilities that Nagios can talk to? Lol

Thanks!

-----Original Message-----
From: nagios-users-***@lists.sourceforge.net [mailto:nagios-users-***@lists.sourceforge.net] On Behalf Of Mihai Tanasescu
Sent: May/12/2008 1:55 PM
To: Nagios Users Mailinglist
Subject: Re: [Nagios-users] Ring topology parent/child relation Nagios
Post by Hugo van der Kooij
This problem should not exist.
Nagios --> Router A --> Router B uplink1+2 ring (and Router B is in a
ring topology which closes in it)

http://tinypic.com/view.php?pic=11uhx7a&s=3 (this is the logical layout)

Yes. But if you cut the 2 uplinks from Router B, then the Nagios machine
will see Router B as up but will not be able to reach any other router
from the ring and will thus alert that all other routers are down (which
is not true).
I mean having split the ring into the 2 halves you suggested that:
C has parent B, D has parent C, E has parent D
G has parent B, F has parent G
=> B up but B uplinks to C and G down -> alerts that C and G are down
although they aren't

Can this be eliminated ? (I'm sure the solution should be simple and
obvious but I'm not being as careful as I should to see it)


Am I right ?


P.S. Currently I am monitoring each link state (up/down) by using SNMP
interface queries (on Cisco routers) and the hosts themselves with
ping/icmp on loopback interfaces that are propagated throughout the
network for reachability(OSPF).
Post by Hugo van der Kooij
Because if you cut the ring in 1 place all nodes can still be reached.
So no router will go down. If you cut it in 2 places you loose part of
the ring and only get alerts for the nodes directly on the other side of
the cuts from your perspective.
If you alert on unreachable as well then you get all the alerts you
tried to get rid of by introducing the parent relation in the first
place. So don't use them.
You need an additional means of detecting your first cut in the ring as
all routers can still be reached at that time and you will never know
you had a problem unless you alert on the actual link conditions.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
tblader
2008-05-12 21:15:52 UTC
Permalink
Mirza Dedic wrote:
<snip>
Post by Mirza Dedic
anyone know of a product you can attach to it that has network capabilities that Nagios can talk to? Lol
Thanks!
http://tldp.org/HOWTO/Coffee.html
--
Flambeau Inc. Technology Center - Baraboo, WI
Email : ***@flambeau.com
Keyserver: http://pgp.mit.edu KeyID: 0x00E9EC2C
Jay R. Ashworth
2008-05-13 19:24:16 UTC
Permalink
Post by tblader
<snip>
Post by Mirza Dedic
anyone know of a product you can attach to it that has network
capabilities that Nagios can talk to? Lol
http://tldp.org/HOWTO/Coffee.html
And let's not forget RFCs 2324 and 2325

Cheers,
-- jra
--
Jay R. Ashworth Baylink ***@baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com '87 e24
St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Those who cast the vote decide nothing.
Those who count the vote decide everything.
-- (Joseph Stalin)
Thomas Harold
2008-05-15 01:17:44 UTC
Permalink
Post by Mihai Tanasescu
Hello,
I know, totally off topic but what if you really wanted to? I want to
monitor our Coffee Machine to warn me when it is running low (so that
I can go there & put a new coffee in for some fresssh coffee).
Now I know it has nothing that Nagios can talk to; so my question
does anyone know of a product you can attach to it that has network
capabilities that Nagios can talk to? Lol
Well, there's always things like the WeatherDuck which hook to the
serial port and support analog / digital sensors. Or their other
products which are IP addressable and connect to analog / digital sensors.

(There's 3 or 4 IP-addressable commercially developed monitoring devices
out there.)

Mark Wagner
2008-05-13 19:23:49 UTC
Permalink
What has me a little concerned is that if someone went into the web
interface on the main server and say scheduled downtime or disabled
notifications, the backup server would never know about it. In the even
to failure people could find themselves getting alerts for a host that
should have been in scheduled downtime (or it was on the main server).
While I realize I would not want to capture and retransmit *all*
external commands to the backup host, if I could somehow get at them I
could filter them over to the backup host (i.e. "ignore most commands,
but pass a few like downtime or host notifications", etc).
Is there any mechanism that allows me to do this? As I understand it
the global host and service events really only capture check results --
they're not going to fire if someone schedules downtime.
I have the same dilemma. I don't think Nagios was designed for multiple
web interface servers and anything you do to try to fix it may be too
hackish (as you're about to see). I wonder how the Nagios-based commercial
apps handle this.

I have written a set of scripts that will scour the Nagios log files and
relay selected commands to the backup. The set up is complicated and
I'm starting to think it is the wrong way to go but I'll present it for
you amusement.

On the main server there is a cron job that runs every two minutes:

*/2 * * * * nagios /usr/lib64/nagios/plugins/relay_to_secondary

The script basically does this:

[ -f "/var/log/nagios/rw/stop_relay" ] && exit 0
/usr/lib64/nagios/plugins/relay_commands \
--start=/var/log/nagios/rw/last_relayed \
--update-start <backup>

The file /var/log/nagios/rw/last_relayed keeps track of the location
in the log file(s) of the last line checked for relaying. The
"--update-start" option updates /var/log/nagios/rw/last_relayed with
the last line checked this run.

The "relay_commands" script is attached at the end of this message. It
parses the log files for (a configurable set of) commands to relay to
the backup and then ssh's (with passwordless keys) to the backup and
cat's these commands to the nagios external command pipe.

Under normal operation everything is OK. However, if the main or backup
servers go down there is additional work besides enabling/disabling
notifications that needs to be done in the event handler.

When the backup goes down you don't want to relay commands so the event
handler on the main will create the /var/log/nagios/rw/stop_relay file.

When the backup comes back you want to start relaying commands so the event
handler on the main will delete /var/log/nagios/rw/stop_relay.

When the main goes down the event handler on the backup
gets the last line in the log file and writes it to a file <foo>.

When the main comes back the event handler on the backup relays
the commands back to the main using the file <foo> as the starting
point.

But wait, there's more! I would like to relay acks/comments and their
deletion as well. However, the "delete comment" command takes an ID
number. If your comments are not exactly synchronized then the wrong
one will be deleted.
--
Mark Wagner <***@u.washington.edu>
System Administrator, UW Medicine IT Services
206-616-6119
Mark Wagner
2008-05-14 00:54:17 UTC
Permalink
What has me a little concerned is that if someone went into the web
interface on the main server and say scheduled downtime or disabled
notifications, the backup server would never know about it. In the even
to failure people could find themselves getting alerts for a host that
should have been in scheduled downtime (or it was on the main server).
While I realize I would not want to capture and retransmit *all*
external commands to the backup host, if I could somehow get at them I
could filter them over to the backup host (i.e. "ignore most commands,
but pass a few like downtime or host notifications", etc).
While I'm at it here are more wrinkles.

Suppose your main and backup web servers are truly passive. What happens
when somebody runs the SCHEDULE_FORCED_SVC_CHECK command from the web
interface? Nothing, unless you relay this command to the Nagios box that
actually does the checking (i.e. the collector).

Now, you may not care about this. Perhaps you are thinking "I'll just set my
retry_interval to 1 minute for everything" but now you have constrained
yourself. We use Nagios to check SSL certs with a day-long interval
for checking and retrying. In this case setting the retry interval to 1 min
doesn't hurt but there will come a day when you have a service that requires
a different retry_interval.

Alternatively you can educate the ops people that some things in the web
interface don't work (SCHEDULE_FORCED_SVC_CHECK). Educating ops is hard,
especially when commands are presented that actually do nothing.

What happens when Nagios is beating up a box and you want to
DISABLE_HOST_SVC_CHECKS? Again, you'll need to relay this to the
collector.

Possibly you just tell the ops people to do these things through the
web interface on the collector. I think that would lead to
confusion. Assuming you have top-notch ops and they won't get confused
you still have to run Apache on the collector and manage users on it now.

In our simple redundant collector config there are at least two collectors
checking each service. Now you have to do the above twice, after you
have figured out which collectors are checking a service.

From considerations like these I don't think Nagios works well in a
distributed config. It works "good enough" and for me the other features
far outweigh these issues.

I've been talking about 2.x. The 3.0 version may have solved these issues
by making a distributed config something more than a hack. I envision
adding the following to a host config stanza:

web <web1>,<web2>
collector <collector1>,<collector2>

Then Nagios can figure out the details of synchronizing the information
in itself.
--
Mark Wagner <***@u.washington.edu>
System Administrator, UW Medicine IT Services
206-616-6119
Jonah Horowitz
2008-05-13 19:25:32 UTC
Permalink
Post by Jay R. Ashworth
Post by Mirza Dedic
I know, totally off topic but what if you really wanted to? I want to
monitor our Coffee Machine to warn me when it is running low (so that
I can go there & put a new coffee in for some fresssh coffee).
It's not OT, but you shouldn't have piggy backed it.
I would use either an optical sensor pointed through the pot position
at about the 20% height level, or a strain gage scale under the entire
coffeemaker, calibrated for the tare weight of the equipment.
Remember that either approach will false-positive when people pick up
the pot to pour coffee.
Urns work better in this environment.
Cheers,
-- jra
These coffee makers (from BUNN) have a built in digital gauge to tell how
much coffee is left in the pot. Perhaps you could hack that sensor to send
SNMP (or NSCA) Alerts.

http://www.bunnomatic.com/pages/commercl/1coffee/apstserv.html#ICB
http://www.bunnomatic.com/pages/windows/1_5_TF_Bless_Serv_B_D.html
http://www.bunnomatic.com/pdfs/commercial/specsheets/a35.pdf

Now, if you get that to work, you should post a how-to.
--
Jonah Horowitz · Monitoring Manager · ***@looksmart.net
W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202
LookSmart - Premium and Performance Advertising Solutions
625 Second Street, San Francisco, CA 94107
Jay R. Ashworth
2008-05-13 19:36:51 UTC
Permalink
Post by Jonah Horowitz
These coffee makers (from BUNN) have a built in digital gauge to tell how
much coffee is left in the pot. Perhaps you could hack that sensor to send
SNMP (or NSCA) Alerts.
http://www.bunnomatic.com/pages/commercl/1coffee/apstserv.html#ICB
http://www.bunnomatic.com/pages/windows/1_5_TF_Bless_Serv_B_D.html
http://www.bunnomatic.com/pdfs/commercial/specsheets/a35.pdf
Now, if you get that to work, you should post a how-to.
And you *gotta* implement RFC 2325. :-)

Cheers
-- jra
--
Jay R. Ashworth Baylink ***@baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com '87 e24
St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

Those who cast the vote decide nothing.
Those who count the vote decide everything.
-- (Joseph Stalin)
Loading...