Stop workers politely
@mrocklin
Closed & re-opened to try and schedule a Travis build.
I wasn't able to get the job scheduler to reliably use the environment variables. This was probably just due to my ignorance on how to use drmaa and job schedulers effectively.
On Thu, Jun 1, 2017 at 5:54 AM, Antoine Pitrou [email protected] wrote:
@pitrou commented on this pull request.
In dask_drmaa/core.py https://github.com/dask/dask-drmaa/pull/31#discussion_r119573180:
n_remaining = len(ids)
worker_addresses = []while len(worker_addresses) < n_remaining:try:worker, action = yield self._worker_updates.get(deadline)except QueueEmpty:logger.error("Timed out waiting for the following workers: %s",sorted(ids))yield client._shutdown(fast=True)raise gen.Return(workers)if action == 'add':worker_addresses.append(worker)# We got enough new workers, see if they correspond# to the runBulkJobs requestenvirons = yield client._run(get_environ, workers=worker_addresses)Have you found what was the cause of the flakiness? It seems that --name itself should be decently robust, and I would surprise if the environment variables weren't always present -- besides, those are the same environment variables my code uses, so it would have the same problem.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-drmaa/pull/31#discussion_r119573180, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszOgmmzSD3aqL7ckTWKFbfDdRQ7Lbks5r_oo-gaJpZM4NrqmZ .
I wasn't able to get the job scheduler to reliably use the environment variables.
You mean set them? Or are they set manually by the system administrator?
The environment variables are, I think, set by the job scheduler when creating the job. So our current approach is to do something like the following:
dask-worker ... --name $JOB_ID:TASK_ID
And we expect this to create a name like 16.1. However I found that this wasn't the case. Perhaps things have improved since then though. Regardless, my hope was to just avoid needing to need this mapping altogether. Is needing this mapping proving to be important now?
Is needing this mapping proving to be important now?
Well, if we want to "politely" stop workers, then yes it is :-) At least, the environment variables need to be set properly.
This is only if we want to politely stop workers with certain Job IDs though, yes? Is this an important feature?
This is only if we want to politely stop workers with certain Job IDs though, yes?
Oh, I hadn't thought about that. Yes, that's a good point.
Is this an important feature?
I don't know. But it had a xfail test case :-)
Any thoughts on the merge conflicts?
@jakirkham that depends if this PR is desirable at all. I see @TomAugspurger did some changes on dask-drmaa semi-recently, perhaps he has an opinion on this.
No strong thoughts. Glancing through the changes, this approach seems better than the changes in https://github.com/dask/dask-drmaa/commit/af9273fb5bd618cf35652522112eba49c90f6688 (which was only focused on making sure that the temporary worker directories are cleaned up).
@jakirkham perhaps you would be interested in trying to rebase this PR / fix the conflicts?
Not right now. Maybe later. Mainly interested in getting this released and then making use of it. Afterwards we can revisit outstanding issues.
@jakirkham depending on the kind of system that you're on you may also find this wiki page of interest: https://github.com/pangeo-data/pangeo/wiki/Getting-Started-with-Dask-on-Cheyenne
This is how I tend to operate on HPC systems these days.
On Fri, Nov 3, 2017 at 5:01 PM, jakirkham [email protected] wrote:
Not right now. Maybe later. Mainly interested in getting this released and then making use of it. Afterwards we can revisit outstanding issues.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-drmaa/pull/31#issuecomment-341826090, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIDeMIO-IH8JUXIk1i-B2rszZE-2ks5sy38vgaJpZM4NrqmZ .
That is very interesting. Thanks for the link.
We do have a strategy in place for starting jobs on the cluster. For legacy reasons, it uses ipyparallel and then launches a distributed cluster on top of that. Though am now thinking that maybe we should just use distributed directly. Switching to this drmaa-based startup method looks to be a small change, which will do the job. So think we'll try that near term to address our needs. If this needs to change again for some reason, will revisit other options down the road.
DRMAA seems simple enough that if it fits it's a good choice. I certainly know of groups that use this package daily. They're able to hand it to new developers who seem to find it comfortable enough. I think that challenges have arisen whenever groups have wanted to do clever things with their job scheduler and the DRMAA interface wasn't sufficiently expressible. In that case sometimes providing a custom job script to dask-drmaa worked nicely. In other cases it was too complex.
Yeah have used DRMAA in the past for other applications and have found it works quite well for simple tasks. Have pretty minimal requirements as to what the Distributed cluster needs to do. So think this should be ok. Especially after some brief experimentation with it.
This may be just my opinion; however, in cases of more complex usage, it's probably not just DRMAA that is insufficiently expressive, but the underlying scheduler as well.