Cook
Cook copied to clipboard
Cook loses track of offers
It is possible for Cook to be assigned an offer and then neither accept or reject the offer. I was able to repro this by running a dev cook connected to minimeso and waiting about 10-15 minutes.
A workaround is to start mesos with offer expiry, however it is worth tracking down where cook loses the offers and fixing it.
Working on it.
Neither I nor @wyegelwel can reproduce. Looking at the code my suspicion was that, for some reason, at the time, cook was unable to connect to Mesos master in order to decline the offer(s). @wyegelwel can we close until it rears its head?
We are currently using Cook to schedule jobs onto Mesos and end up in a state where jobs are perpetually stuck in "WAITING" state. We have to restart the cook server to get it out of this state but eventually it goes back into the same state again.
Could the problem we see be related to this issue?
Also, if we want to debug this further, how would you suggest we go about it? We don't really see anything suspicious in the logs. Thread dumps and heap dumps look fine too. So, we are not sure where to go next.
Hi @itspanzi, thanks for chiming in. Are you able to reproduce this reliably? If so, could you describe how to reproduce it (e.g. Cook Scheduler configuration, job submissions, etc.)?
At the moment, we are running an old version of Cook (0.1.10). So, we are upgrading it today to the latest release version. If the problem still persists, I will add more details here.
Also, a couple of questions:
- Is there a reason why all scheduled jobs would be sitting in a "waiting" state without actually being picked up by Mesos?
- Is there a Slack channel where the Cook team hangs out where we can give more details?
One other approach is to used the unscheduled_jobs endpoint, i.e.
http://cook.example.com/unscheduled_jobs?job=5ccf214c-ba94-4304-9959-c6670d4645c6
That should give you some input one why Cook has been rejecting certain offers for the job. Can you try that and let us know what you see?
@itspanzi We've created a Slack workspace that the dev team is in, feel free to sign up via:
http://cookscheduler.herokuapp.com/
Hi @dposada and @pschorf - thanks for following up on this thread. So, we were running cook 0.1.10 and also had a bug in our framework id in Mesos (we somehow ended up with invalid characters in the framework id that Cook used in Mesos).
We fixed the framework id and upgraded to the latest cook version (one from last week). Things seem to be working as expected. We will keep an eye out on it and if we see anything weird, will drop a note here.
Also - thanks for the Slack info. Will definitely drop by. 👍