Ben Prew
2013-07-30 18:05:28 UTC
Hey,
I'm looking for some suggestions for implementing a service check on a
redundant host pair that access a shared resource.
Here's our setup:
We have N hosts that process (via delayed_job) a shared job queue
(mysql/redis). We have several checks that are host-specific (# of workers
on that host), but we also have several checks that examine the shared job
queue (# of unprocessed jobs).
I have several possible implementations:
============
1. Shared Job Queue check on single processing host (current setup)
Pros:
* We only get notified once when the shared queue is high
Cons:
* If the single host goes down, we lose the shared queue check
============
2. Shared Job Queue check on all processing hosts
Pros:
* If a single processing host goes down, the shared queue check still
functions
Cons:
* Multiple emails from hosts when the shared check fails
============
3. Shared Job Queue check on job queue host (ie the DB box)
Pros:
* If the DB goes down, you can't reach the queue anyway
* Single email on failure
Cons:
* The check requires app knowledge, which requires having the app deployed
on the job queue host
How are others adding a check like this? #2 and just bite the bullet for
multiple emails?
Thanks
I'm looking for some suggestions for implementing a service check on a
redundant host pair that access a shared resource.
Here's our setup:
We have N hosts that process (via delayed_job) a shared job queue
(mysql/redis). We have several checks that are host-specific (# of workers
on that host), but we also have several checks that examine the shared job
queue (# of unprocessed jobs).
I have several possible implementations:
============
1. Shared Job Queue check on single processing host (current setup)
Pros:
* We only get notified once when the shared queue is high
Cons:
* If the single host goes down, we lose the shared queue check
============
2. Shared Job Queue check on all processing hosts
Pros:
* If a single processing host goes down, the shared queue check still
functions
Cons:
* Multiple emails from hosts when the shared check fails
============
3. Shared Job Queue check on job queue host (ie the DB box)
Pros:
* If the DB goes down, you can't reach the queue anyway
* Single email on failure
Cons:
* The check requires app knowledge, which requires having the app deployed
on the job queue host
How are others adding a check like this? #2 and just bite the bullet for
multiple emails?
Thanks