dask-drmaa Stop workers politely

May 31 '17 13:05 pitrou

@mrocklin

May 31 '17 13:05 pitrou

Closed & re-opened to try and schedule a Travis build.

May 31 '17 16:05 pitrou

I wasn't able to get the job scheduler to reliably use the environment variables. This was probably just due to my ignorance on how to use drmaa and job schedulers effectively.

On Thu, Jun 1, 2017 at 5:54 AM, Antoine Pitrou [email protected] wrote:

@pitrou commented on this pull request.

In dask_drmaa/core.py https://github.com/dask/dask-drmaa/pull/31#discussion_r119573180:
       n_remaining = len(ids)
       worker_addresses = []
       while len(worker_addresses) < n_remaining:
           try:
               worker, action = yield self._worker_updates.get(deadline)
           except QueueEmpty:
               logger.error("Timed out waiting for the following workers: %s",
                            sorted(ids))
               yield client._shutdown(fast=True)
               raise gen.Return(workers)
           if action == 'add':
               worker_addresses.append(worker)
       # We got enough new workers, see if they correspond
       # to the runBulkJobs request
       environs = yield client._run(get_environ, workers=worker_addresses)
Have you found what was the cause of the flakiness? It seems that --name itself should be decently robust, and I would surprise if the environment variables weren't always present -- besides, those are the same environment variables my code uses, so it would have the same problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-drmaa/pull/31#discussion_r119573180, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszOgmmzSD3aqL7ckTWKFbfDdRQ7Lbks5r_oo-gaJpZM4NrqmZ .

Jun 01 '17 10:06 mrocklin

I wasn't able to get the job scheduler to reliably use the environment variables.

You mean set them? Or are they set manually by the system administrator?

Jun 01 '17 10:06 pitrou

The environment variables are, I think, set by the job scheduler when creating the job. So our current approach is to do something like the following:

dask-worker ... --name $JOB_ID:TASK_ID

And we expect this to create a name like 16.1. However I found that this wasn't the case. Perhaps things have improved since then though. Regardless, my hope was to just avoid needing to need this mapping altogether. Is needing this mapping proving to be important now?

Jun 01 '17 10:06 mrocklin

Is needing this mapping proving to be important now?

Well, if we want to "politely" stop workers, then yes it is :-) At least, the environment variables need to be set properly.

Jun 01 '17 10:06 pitrou

This is only if we want to politely stop workers with certain Job IDs though, yes? Is this an important feature?

Jun 01 '17 10:06 mrocklin

This is only if we want to politely stop workers with certain Job IDs though, yes?

Oh, I hadn't thought about that. Yes, that's a good point.

Is this an important feature?

I don't know. But it had a xfail test case :-)

Jun 01 '17 10:06 pitrou

Any thoughts on the merge conflicts?

Nov 03 '17 16:11 jakirkham

@jakirkham that depends if this PR is desirable at all. I see @TomAugspurger did some changes on dask-drmaa semi-recently, perhaps he has an opinion on this.

Nov 03 '17 17:11 pitrou

No strong thoughts. Glancing through the changes, this approach seems better than the changes in https://github.com/dask/dask-drmaa/commit/af9273fb5bd618cf35652522112eba49c90f6688 (which was only focused on making sure that the temporary worker directories are cleaned up).

Nov 03 '17 17:11 TomAugspurger

@jakirkham perhaps you would be interested in trying to rebase this PR / fix the conflicts?

Nov 03 '17 20:11 pitrou

Not right now. Maybe later. Mainly interested in getting this released and then making use of it. Afterwards we can revisit outstanding issues.

Nov 03 '17 21:11 jakirkham

@jakirkham depending on the kind of system that you're on you may also find this wiki page of interest: https://github.com/pangeo-data/pangeo/wiki/Getting-Started-with-Dask-on-Cheyenne

This is how I tend to operate on HPC systems these days.

On Fri, Nov 3, 2017 at 5:01 PM, jakirkham [email protected] wrote:

Not right now. Maybe later. Mainly interested in getting this released and then making use of it. Afterwards we can revisit outstanding issues.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-drmaa/pull/31#issuecomment-341826090, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIDeMIO-IH8JUXIk1i-B2rszZE-2ks5sy38vgaJpZM4NrqmZ .

Nov 03 '17 21:11 mrocklin

That is very interesting. Thanks for the link.

We do have a strategy in place for starting jobs on the cluster. For legacy reasons, it uses ipyparallel and then launches a distributed cluster on top of that. Though am now thinking that maybe we should just use distributed directly. Switching to this drmaa-based startup method looks to be a small change, which will do the job. So think we'll try that near term to address our needs. If this needs to change again for some reason, will revisit other options down the road.

Nov 03 '17 21:11 jakirkham

DRMAA seems simple enough that if it fits it's a good choice. I certainly know of groups that use this package daily. They're able to hand it to new developers who seem to find it comfortable enough. I think that challenges have arisen whenever groups have wanted to do clever things with their job scheduler and the DRMAA interface wasn't sufficiently expressible. In that case sometimes providing a custom job script to dask-drmaa worked nicely. In other cases it was too complex.

Nov 03 '17 21:11 mrocklin

Yeah have used DRMAA in the past for other applications and have found it works quite well for simple tasks. Have pretty minimal requirements as to what the Distributed cluster needs to do. So think this should be ok. Especially after some brief experimentation with it.

This may be just my opinion; however, in cases of more complex usage, it's probably not just DRMAA that is insufficiently expressive, but the underlying scheduler as well.

Nov 05 '17 22:11 jakirkham

Stop workers politely

@pitrou commented on this pull request.