distributed [WIP] Use of new dask CLI : job submission

Implementation of dask job submit. This is intended to consume a python script as:

dask job submit <cluster-name> <script-path>

There are potentially two different usage patterns we are anticipating:

the script features use of a dask collection or builds delayed objects and calls <object>.compute(); this should use the dask cluster to execute all .compute calls.
the script features no dask-isms at all, but is intended to be a single worker job as if the script were just a function submitted via client.submit.

It may make sense for these two patterns to be served by two different subcommands. Exploring that here to settle on an approach.

[ ] Tests added / passed
[ ] Passes pre-commit run --all-files

Jul 17 '22 17:07 dotsdl

Can one of the admins verify this patch?

Jul 17 '22 17:07 GPUtester

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 7m 17s :stopwatch: - 18m 37s   2 977 tests ±0   2 888 :heavy_check_mark: +1     87 :zzz: - 1 2 :x: ±0 22 072 runs ±0 21 035 :heavy_check_mark: +1 1 035 :zzz: - 1 2 :x: ±0

For more details on these failures, see this check.

Results for commit b854dc98. ± Comparison against base commit 930d3dc0.

Jul 17 '22 17:07 github-actions[bot]

I think that we need to think a little bit about how this would be used. When submitting a script how does that script get access to the scheduler? Naively I might think that the following should work:

from dask.distributed import Client

client = Client()  # connects to the current Dask scheduler

df = dask.dataframe.read_parquet(...)
df....

Great. If this is the kind of workflow that we're thinking of then we'll need to address a few issues:

Make sure that we have the right address of the scheduler, and that we can connect to it reliably (for example, what happens if the scheduler is under TLS security?)
Make sure that our script can be a normal blocking script and not block the event loop (run_on_scheduler might not work?)

If this isn't the kind of workflow that we're thinking of then we should figure out what that workflow is and make sure that it works well.

Jul 17 '22 19:07 mrocklin

Like @mrocklin said, the cluster discovery seems like a major question here (and something that I believe has been out of scope for the core dask project up until now). This seems closely related to dask-ctl https://github.com/dask-contrib/dask-ctl, cc @jacobtomlinson. Maybe a job subcommand would make more sense as a part of dask-ctl than core dask?

Jul 18 '22 13:07 gjoseph92

@gjoseph92 this work was done at the SciPy sprints where @mrocklin, @jsignell, @jcrist, @charlesbluca and I made some longer-term plans about CLI in Dask generally. I think the plans is to migrate some of the core functionality from dask-ctl up to distributed at some point and merge it into the existing CLI tooling in a more consistent way.

@douglasdavis started some of this work to move us towards an extensible dask CLI command. @dotsdl picked up this idea around submitting scripts and ran with it.

I have an action from the sprints to write this topic up in design-docs.

Jul 19 '22 08:07 jacobtomlinson

Haven't had time to circle back to this folks. Apologies for the delay, and thank you for the feedback!

Jul 26 '22 15:07 dotsdl

distributed distributed copied to clipboard

[WIP] Use of new dask CLI : job submission

Unit Test Results

distributed
distributed copied to clipboard