Simon icon Simon
Flexible server monitoring

Uptime Calculations - Time Based vs Count Based

When displaying uptime for tests, Simon computes the number of successes divided by the number of failures. While this is fine for equally spaced tests, it doesn't work for uneven tests.

Consider hourly tests that switch to every minute on failure (many of my tests use this scheme).

A day of uptime followed by an hour of downtime, then fixed = 24 successful tests, 60 failed tests = 28.5% uptime.

Obviously this isn't the biggest problem, but I really want to see more realistic numbers. My (for the last 4 weeks) 100% uptime server just went down to 85% from just a couple hours of downtime. It just doesn't reflect reality.

Anyhow, add this to my request to be able to delete specific tests (for example, when Simon fails a test due to being on public WiFi or ad-hoc networks) to avoid them sullying the results.

David Sinclair's picture

Re: Uptime Calculations - Time Based vs Count Based

I agree, the calculation can become less accurate when there's a difference between success and failure check frequencies.

I'm not sure how to improve that... maybe add some sort of weighting factor for the frequency.

Re: Uptime Calculations - Time Based vs Count Based

I'd suggest that it counts the amount of time between the failure and the recovery. So:

10:00: success
11:00-11:35: failures
11:36: recovery
12:36: success

It would count 36 minutes of downtime and 2:36 of total time, thus 2:00 of uptime = 77% uptime. Of course it could still be off (what time between 10:00 and 11:00 did it fail?) but that's just inherent to the design. You could also have an option to weight it, i.e. an 11:00 failure with the last success at 10:00 would add an extra half hour of downtime to average for the time between 10:00 and 11:00 that it may have failed, but I don't think it would be strictly necessary...

David Sinclair's picture

Re: Uptime Calculations - Time Based vs Count Based

That may be feasible... though allowing for times when Simon isn't running (or the test is paused) may complicate things. That time shouldn't be counted as a success or failure. For example, if you pause a test while it is having a failure, and resume it a day later, using time would indicate a long downtime, whereas it may have actually only been short. With the counting approach, that period doesn't affect the calculation.

Re: Uptime Calculations - Time Based vs Count Based

But if you use counts, then the failure-dependent testing times will.

Mathematically, if they're weighted by frequency (your original post) it should actually work out the same though.

Ergo, as with my previous example:
10:00: success
11:00-11:35: failures
11:36: recovery
12:36: success
= 3 successes, 36 failures (successes weighted as 1 hour = 60, failures weighted as 1 minute = 1)
= (3*60) / ((3*60)+(36*1)) = 83.3% uptime. Counting what actually should have been 3 hours of uptime in the previous post (12:36-1:36 for the last success) you'd have by my previous calculations (3:00/3:36) = 83.3%. So just multiply each test by the number of minutes of a success or failure prior to computing the ratio.

The only problem is that the math would break down if you change the ratio (i.e. switch from every minute to every 5 minutes without resetting the test) or if you manually initiate a test (which I do occasionally). So it's not perfect, but it's closer to the truth.

David Sinclair's picture

Re: Uptime Calculations - Time Based vs Count Based

Okay, you've convinced me. I've just made this change for the next release!

David Sinclair's picture

Re: Uptime Calculations - Time Based vs Count Based

Included in Simon 3.6b3, just released.

Re: Uptime Calculations - Time Based vs Count Based

It looks like you're using the new calculations in the lower info panel, but not in the table view. ;) Everything else seems to be working great.

David Sinclair's picture

Re: Uptime Calculations - Time Based vs Count Based

Hmm, should be the same calculation, but I'll check that.

Re: Uptime Calculations - Time Based vs Count Based

Here's a screenshot of what I'm getting. (The value on the bottom and the pie chart are both correct, the value in the list is not.)

imgur DOT com/XcdjiCe
...trying to avoid the link spam filter :)

David Sinclair's picture

Re: Uptime Calculations - Time Based vs Count Based

Fixed for 3.6b4. Thanks for your help with this!

Re: Uptime Calculations - Time Based vs Count Based

Still shows the old values running 3.6b4: imgur DOT com/Kw3xRzU

David Sinclair's picture

Re: Uptime Calculations - Time Based vs Count Based

Did you check each test? It won't recalculate the Up Time in the table until the next check.