Long delay between failure and alert

March 2005

Version 5.8.2

We are using Serverscheck to monitor about 40 servers. As far as I can see the servers are monitored sequentially. This introduces a problem with the alerts.

If I pull the cable from a random server it takes 5 minutes before I receive my first alert and about the same time between subsequent alerts. This is too long, I wan't to know if a system is down in max 30 seconds.

Is there a way to ping the servers simultaneously instead of sequentially???

These are the settings for the ping test:

Number of retries before rule fails? 2

Minimal interval between 2 checks? 5 sec

Interval when status is down? 5 sec

March 2005

The number of threads depends on the version you have purchased. The threads can be seen as the number of monitoring_rule.exe running. The Professional version is sequential but the Enterprise versions are multi-threaded.

You can never perform a PING simultaneously as one 1 TCP/IP packet can be sent over the network at the same time.

The alerts are being sequentially and this is due to the fact that some type of alerts (like SMS, MSN) can not be sent in parallel

Regards,

Forum Administrator

March 2005

I've got 5 instances of monitoring_rule.exe so we are using the enterprise version.

Even if I disable all rules except the 40 ping rules it takes 5 minutes. Even if it's done sequentially with five different threads and a ping test taking 1 second it should give me the first alert in 40/5 * 1 second * retries (2) = 16 seconds, not 5 minutes.

It is true only one ping packet can be sent at any given time, but a ping takes a few milliseconds! so the application should be able to ping all 40 servers in a second.

There is nothing I can do to speed things up??

March 2005

Our tech support will contact you as you have a support agreement.

March 2005

I also experience the long delay between a problem being found and receiving an alert.

I too am using the Enterprise Edition 5.8.2 (with updates) on a dual 2.4Ghz Server...

At the moment I have a seperate screen just for monitoring servers check, so I can see a problem as soon as it arises... (far from ideal!)

Thanks

March 2005

You have to be aware that the number of retries influences the speed of sending out the alert.

In other words: if you set the retries to 0 then the alert is immediately sent out when the system does not respond as expected.

The retries are performed in check cycles.

Could you please email your alerts log file to [email protected]? This way we can check second by second between going to status DOWN and when the alerts have been sent.

Regards,

Forum Administrator

March 2005

Based upon internal discussions we have decided to introduce following feature:

when the "interval when down" option is set to 0 then it will execute the retry immediately (so by the same thread) after the first failed.

This feature is currently being analyzed by our development team and it has some impact on the app's internal architecture. As a result we expect this feature to be available in April.

March 2005

I too am experiencing this type of issue.

For example:

Device switched off at 10pm

Receive alert saying it is down at 5am(!)

Device configured within ServersCheck as:

Number of retries before failure: 4

Minimal time between two checks: 60 seconds

Interval when status is down: 60 seconds

March 2005

What is the content of your log file for that device? (the main log file in the logging subdirectory). It will tell you exactly when it detected the error and when it notified.

Please send those log files to [email protected] and the actual error message you received (as this contains timing information as well)

March 2005

Could you please send us the log files to [email protected]?

Regards,

Forum Administrator

Long delay between failure and alert

Comments