smarthome icon indicating copy to clipboard operation
smarthome copied to clipboard

Cron Scheduler Replacement?

Open kaikreuzer opened this issue 7 years ago • 10 comments

We have learned over the past months that implementing a cron scheduler is not a trivial task - our current implementation has caused quite some issues, which are hard to analyse, debug and fix.

Instead of fully inventing it on our own, we should check whether using existing implementations might be a viable option. Two potential candidates have already been mentioned in the past and I would like to at least check their pros & cons to come to an informed decision:

  • cron-utils: Seems to be an actively maintained project with a nice API and many features. Not sure whether it can deal with scheduling jobs itself though, it seems that it only calculates next execution times.
  • OSGi enRoute Scheduler: Looks like a small and straight-forward scheduler that can also deal with cron expressions and which might even become part of the OSGi spec one day.

kaikreuzer avatar Aug 25 '17 09:08 kaikreuzer

FYI: The same github person that hosts the cron-utils project also hosts an cron-utils-scheduler project. Perhaps we could have a look at this one, too.

maggu2810 avatar Aug 26 '17 13:08 maggu2810

Cons : it only supports cron expressions

I initially implemented the scheduler as part of a larger exercise to have a built in caldav server, but that was never reviewed, and to replace the quartz scheduler. I would not like to throw that effort away

This aside, imho this kind of effort is energy badly spent. What kind of issues did we have, and that we are unable to solve? Does the 2-3 bugs we had to fix really make that we have throw all this away? I don it understand your reasoning here

kgoderis avatar Aug 26 '17 14:08 kgoderis

Cons : it only supports cron expressions

Right, that's a good point - caldav support can indeed become an important feature.

What kind of issues did we have, and that we are unable to solve?

We had a couple of issue that haven't been solved: https://github.com/eclipse/smarthome/issues/3185, https://github.com/eclipse/smarthome/issues/3912 and https://github.com/eclipse/smarthome/issues/4145 and a few others that were solved.

The problem is that all of them are rather critical issues and quite some people actually already tried looking into them, but failed to come up with a fix. I have to admit that I personally struggle to understand the code and its inner working - so I am of no huge help here either.

kaikreuzer avatar Aug 28 '17 13:08 kaikreuzer

I just tried to fix #3185 with PR #4154. It took me a couple of days to understand what is going on in the scheduler and I am still not sure that I understood everything. I have to say that the code is pretty hard to read because:

  • classes that contain static inner classes such as ExpressionThreadPoolManager
  • doubled code: AbstractExpression.parseExpression() is called 3 times here (2 times in the setters above)
  • code with (not needed) side effects: AbstractExpression.getTimeAfter() sets a start time?
  • magic numbers in the code: CronExpression() uses a "2" to state it wants to keep 2 execution time candidates

@kgoderis Could you please have a look at my PR to see if I got it right? The thing I am trying to do there is to detect time jumps larger than 3 seconds by storing the time the monitorTask plans to wake up and then comparing it to the time it actually woke up. In case of the daylight saving time change there will be a 1 hour gap which will then lead to a rescheduling of all scheduled tasks.

In order do do that I had to change the behavior of the monitorTask to wake up regularly and perform this check. I also introduced the DateWrapper which offers me the the possibility to fake a time jump for testing purposes, see my 2 new test cases.

Is there someone else familiar with the scheduler code who could maintain it, except for you?

triller-telekom avatar Aug 29 '17 14:08 triller-telekom

@triller-telekom @kaikreuzer

I do acknowledge that the code of the scheduler is quite hard, and it became quite hard when I tried to combine cron expressions with RFC5545 expressions. These are quite different animals, for example, in RFC5545 one can specify an end date, and I tried to come up with an algorithm that would be able to calculate both expressions. If you would only consider cron, then it would be far more simple. From that aspect I do agree with Kai's initial concern

In short, the algorithm does basically this: take one or more seeds (e.g. the start date), and expand those seeds by applying parts of the given expression so that the set of candidate dates grows. In order to keep the number of candidates reasonable - as it explodes very rapidly, e.g. expressions with * specified for the "second" expression part - the set of candidate dates is trimmed/pruning to remove the candidates that would not "make it", e.g. that are unlikely to be triggered next. The generation of candidate dates is further optimised by giving the expression parts a specific order (e.g. in case of cron, we first expand the "year" part, and the "second" part last)

Now, some early bugs in the code where jointly fixed by @maggu2810 and myself, so he is the only one that also up to speed with the code, but even for me I have to revisit it every time since I wrote that such a long time ago.

https://github.com/eclipse/smarthome/issues/3912 (After debugging a little bit I found out, that CronExpression#applyExpressionParts is not able to calculate dates, that would require a change to the next year, it fails e.g. for 0 0 0 ? * *' and 2017 Dec 31, 23:01.) -> I think this one has to do with the pruning part of the algorithm that is too aggressive. It can be solved by keeping for the "year" part of the expression an additional candidate. If I remember well (long time ago!), I limited execution to the current year.

https://github.com/eclipse/smarthome/issues/4145 -> Could be linked to fact that none of the expressions types have a granularity at the millisecond level (e.g. smallest is a 1 second), and therefore the algorithm has to round some of the Date(...) outputs. If rounded downwards, this might lead to side effects for special cases like midnight

With respect to the specific comments/questions you make:

  • AbstractExpression.parseExpression() is indeed called frequently, as the objective was to have the Expression object always "loaded"/"up to date" and ready to be queried with getTimeAfter(). Maybe this was a bad decision choice from my part
  • The AbstractExpression.getTimeAfter() sets a start time, but in fact to "reset" the expression back to its original start date. The thing is that for certain expressions, because they are floating for example (like cron) with no real start date, you have to set "a" start date in order to calculate the candidates.
  • Magic numbers are indeed magic ;-) In this case it is needed in order to keep the algorithm going when there would not be sufficient candidates (e.g. we do not want to run out of "seeds" to calculate new candidates upon when applying the next part of the expression). We need one "spare" seed because if the expression would generate a candidate that is equal to the current date, then that candidate would be pruned anyways (e.g. it is getTimeAfter(), not getTimeAtorAfter() )

So, lengthy post, but that means @kaikreuzer that I am back in business/shape.

K

kgoderis avatar Sep 02 '17 16:09 kgoderis

take one or more seeds (e.g. the start date), and expand those seeds by applying parts of the given expression so that the set of candidate dates grow

Regarding this: could #4130 be caused by some kind of overflow or memory leak?

FrankR59 avatar Sep 03 '17 16:09 FrankR59

@FrankR59 Clearly not, see my https://github.com/eclipse/smarthome/issues/4130#issuecomment-325382966.

kaikreuzer avatar Sep 04 '17 08:09 kaikreuzer

Regarding this: could #4130 be caused by some kind of overflow or memory leak?

That issue is not related to this one, see https://github.com/eclipse/smarthome/issues/4130#issuecomment-325382966

That being said, AFAIK there never have been memory leak problems with Quartz.

kgoderis avatar Sep 07 '17 19:09 kgoderis

Anything new about this issue? I am using rpi3/stretch with ssd instead of sd-card and oh2.3.0 meanwhile, but time based cron rules still are stopping every few days!

FrankR59 avatar Sep 08 '18 18:09 FrankR59

I wanted to reference this issue: https://community.openhab.org/t/cron-didnt-fire-summer-time-issue/42477 because I think it might be related. Contrary to the title of the discussion I now believe that the cron fires. It probably just fires at the wrong (old) time. Since the calculations in the rule body probably use the correct (new) time, the value of the item didn't change.

jewesta avatar Apr 04 '19 09:04 jewesta