Cook icon indicating copy to clipboard operation
Cook copied to clipboard

Cook loses track of offers

Open wyegelwel opened this issue 7 years ago • 8 comments

It is possible for Cook to be assigned an offer and then neither accept or reject the offer. I was able to repro this by running a dev cook connected to minimeso and waiting about 10-15 minutes.

A workaround is to start mesos with offer expiry, however it is worth tracking down where cook loses the offers and fixing it.

wyegelwel avatar Mar 27 '17 19:03 wyegelwel

Working on it.

mforsyth avatar Apr 03 '17 13:04 mforsyth

Neither I nor @wyegelwel can reproduce. Looking at the code my suspicion was that, for some reason, at the time, cook was unable to connect to Mesos master in order to decline the offer(s). @wyegelwel can we close until it rears its head?

mforsyth avatar Apr 06 '17 13:04 mforsyth

We are currently using Cook to schedule jobs onto Mesos and end up in a state where jobs are perpetually stuck in "WAITING" state. We have to restart the cook server to get it out of this state but eventually it goes back into the same state again.

Could the problem we see be related to this issue?

Also, if we want to debug this further, how would you suggest we go about it? We don't really see anything suspicious in the logs. Thread dumps and heap dumps look fine too. So, we are not sure where to go next.

itspanzi avatar Jun 05 '18 18:06 itspanzi

Hi @itspanzi, thanks for chiming in. Are you able to reproduce this reliably? If so, could you describe how to reproduce it (e.g. Cook Scheduler configuration, job submissions, etc.)?

dposada avatar Jun 05 '18 18:06 dposada

At the moment, we are running an old version of Cook (0.1.10). So, we are upgrading it today to the latest release version. If the problem still persists, I will add more details here.

Also, a couple of questions:

  • Is there a reason why all scheduled jobs would be sitting in a "waiting" state without actually being picked up by Mesos?
  • Is there a Slack channel where the Cook team hangs out where we can give more details?

itspanzi avatar Jun 05 '18 22:06 itspanzi

One other approach is to used the unscheduled_jobs endpoint, i.e.

http://cook.example.com/unscheduled_jobs?job=5ccf214c-ba94-4304-9959-c6670d4645c6

That should give you some input one why Cook has been rejecting certain offers for the job. Can you try that and let us know what you see?

pschorf avatar Jun 11 '18 15:06 pschorf

@itspanzi We've created a Slack workspace that the dev team is in, feel free to sign up via:

http://cookscheduler.herokuapp.com/

dposada avatar Jun 12 '18 15:06 dposada

Hi @dposada and @pschorf - thanks for following up on this thread. So, we were running cook 0.1.10 and also had a bug in our framework id in Mesos (we somehow ended up with invalid characters in the framework id that Cook used in Mesos).

We fixed the framework id and upgraded to the latest cook version (one from last week). Things seem to be working as expected. We will keep an eye out on it and if we see anything weird, will drop a note here.

Also - thanks for the Slack info. Will definitely drop by. 👍

itspanzi avatar Jun 12 '18 17:06 itspanzi