Simon icon Simon
Flexible server monitoring

Simon crashes and unresponsivness

We've got ~170 tests setup and we've experienced some flakiness which we could use some help figuring out.

This weekend Simon crashed. We submitted the crash report.

Currently right now, Simon is unresponsive for 5+ minutes at a time with 90%+ CPU usage. I was able to pause all tests and turn the time up between tests to the max to try and help with this as much as possible. It seems to be running better like this, but tests are getting backed up.

Is there anything we can do to fix this?

(Edit: Attached sample stuff)

AttachmentSize
Sample of Simon.txt29.57 KB
Sample Simon Mem.png35.78 KB
Sample Simon Stat.png40.27 KB
Queue 1 day Sample.txt2.81 KB
Queue 1 day.png37.97 KB
queue 4 days.png159.43 KB
queue 4 days2.png142.5 KB
queue 4 days.txt41.03 KB
triple.png5.97 KB
David Sinclair's picture

Re: Simon crashes and unresponsivness

Thanks for the crash report. It did seem to indicate running out of resources, possibly due to too many tests being checked at once. Simon tries to avoid this by spacing out the checks; previous versions limited it to one check at a time, which solved that, but caused some tests to not be checked when there are lots of tests. Version 2.3.5 instead adds a brief delay between each check, to allow earlier ones to finish, but still allows multiple checks at once.

When it was unresponsive, how many tests were being checked? Was it showing a SPoD cursor?

Increasing the time between checks, when you have that sort of quantity, is probably the best solution for now. That'll allow each test enough time to be checked and free it's resources for other tests.

I am in the planning stages for a future version that will have other changes that should help with this situation. It'll use separate helper apps for each check, thus avoiding the resource limitations on a single app, and increasing reliability: a crash won't take out Simon.

There where quite a few

There where quite a few tests all checking at once. Maybe 5 or 6. And yes, I had a SPoD.

I've got the check pause set to 6 seconds right now. Things are backed up 4 minutes. I've slowly been backing it down.

I did a Activity monitor "Sample process" while it was doing this. I've got it saved if that would help also.

David Sinclair's picture

Re: There where quite a few

Yes, a longer interval could improve that further.

Please send the sample; it might be helpful.

Samples

I've attached the sample to this thread along with 2 pngs.

David Sinclair's picture

Re: Samples

Thanks for that.

Interestingly, it's spending most of it's time, at least in that sample, sorting the log entries. Integrating and sorting the log entries isn't a trivial operation, but shouldn't be slowing down Simon.

What values do you have for the log preferences? Have you increased them significantly from the defaults? But I guess with hundreds of tests, even the default of 100 entries for each of 170 tests might tax things a little... so you could try decreasing those values a bit.

We are also seeing an issue

We are also seeing an issue where tests don't get run. Currently, we have 30 or so tests queued for up to a full day. Over the course of writing this, the number has decreased. So something I've done has "woke" it up and started to run these tests.

I've got a sample and a screen shot uploaded of this also.

I turned the log entries down alot. 500 to 100, 100's to 50's.

David Sinclair's picture

Re: We are also seeing an issue

Interesting. Your sample shows Simon idling normally, so nothing wrong there.

The excessive queuing is unexpected, though. Are you on version 2.3.5? Earlier 2.3 versions had some issues with the queuing as I experimented with the best approach.

Yeah, we are on 2.3.5. It's

Yeah, we are on 2.3.5. It's humming along right now with an average queue time of 2.5 minutes. If it goofs up again, I'll take some more samples.

Queue of 4 days

David, we've got the queue bug again.

It's at 4 days.

I've attached a sample and pics.

Is there anything else I can do to help you? We can't move to this (from our current solution) until we get this ironed out.

Justin

David Sinclair's picture

Re: Queue of 4 days

I'm sorry it's misbehaving for you. I think I need more information to diagnose this.

How many tests to do you have, total?

How many of those are queued?

Do you actively use the computer Simon is on, or does it run unattended?

What column are you sorting on in Simon? (The tests are checked in the order they are listed, so sometimes changing the sort order can help, e.g. sort by Next Check with the queued ones at the top.)

Depending on your answers to the above (and trying resorting), the next step might be for you to email me a copy of your data so I can recreate the problem.

How many tests to do you

How many tests to do you have, total?

163

How many of those are queued?

~72

Do you actively use the computer Simon is on, or does it run unattended?

unattended on a dedicated 1.83 CD MacMini 2gb ram.

What column are you sorting on in Simon?

Name

Depending on your answers to the above (and trying resorting), the next step might be for you to email me a copy of your data so I can recreate the problem.

I'll try sorting on next check.

David Sinclair's picture

Sorting by Next Check should fix it

Thanks for the info. I'm pretty confident that sorting by the Next Check column will fix it.

This could be considered a bug, but it only affects people with lots of tests that are checked fairly frequently. It goes down the list in sort order and queues each test that is due, and since the earlier ones become due again before it gets to the later ones, they end up getting checked again, further delaying the later ones, to the point that they never get checked. Sorting on Next Check allows each test its turn.

I have plans to completely revamp the scheduling in a future version, but I would like to fix this sooner, as it does affect a few people. I've just thought through the queueing logic, and think I've come up with a solution, which I'll implement in the next release.

David, I've seen something

David, I've seen something like this a few times. Cosmetic error?

The average queue length is ~4 minutes now. So sorting by next check does help.

David Sinclair's picture

Re: David, I've seen something

Yes, I've seen that on occasion. It's just a cosmetic thing, due to the way the spinners are drawn. In technical terms, they are subviews on the table view, rather than normal table cells, so if the row order changes while they're visible, they can be left in the wrong location.

I have a solution in mind for a future version, but it hasn't been a priority to fix, being purely cosmetic and temporary (just till the check completes).

I'm glad sorting by next check helped. That's the workaround for now, but as I said, I do plan to fix that in the next release.