Long delay between failure and alert
Version 5.8.2
We are using Serverscheck to monitor about 40 servers. As far as I can see the servers are monitored sequentially. This introduces a problem with the alerts.
If I pull the cable from a random server it takes 5 minutes before I receive my first alert and about the same time between subsequent alerts. This is too long, I wan't to know if a system is down in max 30 seconds.
Is there a way to ping the servers simultaneously instead of sequentially???
These are the settings for the ping test:
Number of retries before rule fails? 2
Minimal interval between 2 checks? 5 sec
Interval when status is down? 5 sec
We are using Serverscheck to monitor about 40 servers. As far as I can see the servers are monitored sequentially. This introduces a problem with the alerts.
If I pull the cable from a random server it takes 5 minutes before I receive my first alert and about the same time between subsequent alerts. This is too long, I wan't to know if a system is down in max 30 seconds.
Is there a way to ping the servers simultaneously instead of sequentially???
These are the settings for the ping test:
Number of retries before rule fails? 2
Minimal interval between 2 checks? 5 sec
Interval when status is down? 5 sec
This discussion has been closed.
Comments
You can never perform a PING simultaneously as one 1 TCP/IP packet can be sent over the network at the same time.
The alerts are being sequentially and this is due to the fact that some type of alerts (like SMS, MSN) can not be sent in parallel
Regards,
Forum Administrator
Even if I disable all rules except the 40 ping rules it takes 5 minutes. Even if it's done sequentially with five different threads and a ping test taking 1 second it should give me the first alert in 40/5 * 1 second * retries (2) = 16 seconds, not 5 minutes.
It is true only one ping packet can be sent at any given time, but a ping takes a few milliseconds! so the application should be able to ping all 40 servers in a second.
There is nothing I can do to speed things up??
I too am using the Enterprise Edition 5.8.2 (with updates) on a dual 2.4Ghz Server...
At the moment I have a seperate screen just for monitoring servers check, so I can see a problem as soon as it arises... (far from ideal!)
Thanks
In other words: if you set the retries to 0 then the alert is immediately sent out when the system does not respond as expected.
The retries are performed in check cycles.
Could you please email your alerts log file to [email protected]? This way we can check second by second between going to status DOWN and when the alerts have been sent.
Regards,
Forum Administrator
when the "interval when down" option is set to 0 then it will execute the retry immediately (so by the same thread) after the first failed.
This feature is currently being analyzed by our development team and it has some impact on the app's internal architecture. As a result we expect this feature to be available in April.
For example:
Device switched off at 10pm
Receive alert saying it is down at 5am(!)
Device configured within ServersCheck as:
Number of retries before failure: 4
Minimal time between two checks: 60 seconds
Interval when status is down: 60 seconds
Please send those log files to [email protected] and the actual error message you received (as this contains timing information as well)
Regards,
Forum Administrator