Simon icon Simon
Flexible server monitoring

Failures every 30 minutes in 3.2?


I'm using Simon to monitor various ressources on my company's network. When I updated from Simon 3.1.1 to 3.2, my checks started to fail every 30 minutes.

Here are some details : I have about 20 checks running, set to run every minute. The network is rather new and served with a Gigabit switch. The checks use Ping, and are set to send 10 packets. A failure is triggered by an average ping latency greater than 100ms, a lost packet, or when the test takes longer than a minute.

Every 30 minutes, the check fails with a "check time" superior to 1 minute, then immediately recovers (the next check is only one minute away). Average check time is 10 secs.

Running ping in a console shows that average ping times for the services I'm checking are in the 10-20ms range, but I can see a couple of spikes, every couple of times when a single ping packet takes anywhere from 50 to 200ms. The server and the network is lightly loaded.

When I increase the acceptable average ping latency, failures get less frequent. However, since the problem went away when I reinstalled Simon 3.1.1, I think that this is really an issue with the 3.2 release.

Any help would be appreciated.

Best Regards,

David Sinclair's picture

Re: Failures every 30 minutes in 3.2?

I haven't heard of any issue with the Ping service in 3.2 before.

It sounds a bit like an issue a few people have had with other services, where Simon can start to time out if overloaded. But surprising that Ping would be affected, since it uses an external process to do the work.

I suggest editing the tests to be a little less frequent — like say every 5 or 10 minutes. See if that helps.

But I definitely need to investigate this further.

Re: Failures every 30 minutes in 3.2?

Well, something's telling me that few people are using 3.2 yet, so this may be a new issue.

Is there a way to get more detailed logs, especially the logs of this external process ?

Also, please note that I do get a few similar failures with 3.1.1 (maybe once a day), but I'm running 20 checks every minute. With 3.2, I get hundreds of failures per day (and therefore hundreds of emails).


David Sinclair's picture

Re: Failures every 30 minutes in 3.2?

There isn't a log for the Ping service, though you can use Preview to try it manually; that might provide a clue as to what is happening.