ray [CoreWorker] lazy bind core_work's job

Why are these changes needed?

Previously the worker get job_config information from raylet on construction. This prevents us from lazily binding job_config to workers. This PR enables lazily bind job_config, by piggybacking job_confg in TaskSpec, and initialize the job_config when the worker receives task execution request (push_task) call.

We also refactor the WorkerContext and RayletClient as part of the chagne.

Related issue number

Checks

[ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[ ] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Dec 30 '22 22:12 scv119

@liuyang-my the Java test failed but i'm not quite sure what exactly happened reading the logs. Do you know what might have gone wrong? (presumably we are hitting some deadlock issues?) https://buildkite.com/ray-project/oss-ci-build-pr/builds/8396#01857127-086b-406b-92da-6f935dcc8447 is the failed test

Jan 02 '23 22:01 scv119

Question: will this slow down the perfs? I think this adds runtime env to all task specs (previously, only one). Do you mind benchmarking the perf regression?

Besides this, do you think it's good to pass job config through stdin for the workers? If doing this way, we probably could limit the all changes in worker pool.

I'm also thinking in the future this maybe need extension. We probably don't want to pass everything to task spec I believe.

Btw, ok with this approach if the benchmark with job config is ok. But let's add comment to job config proto to let people know it's passed to all tasks repeatedly.

Still reviewing...

Jan 06 '23 03:01 fishbone

thanks for reviewing!

kicking off benchmark here: https://buildkite.com/ray-project/release-tests-pr/builds/24753

Jan 06 '23 05:01 scv119

No external behavior change, the only thing changes is we delay the job_config binding until it receives the first task_spec.

Jan 06 '23 18:01 scv119

@scv119 the default microbenchmark doesn't use job config. What concerns me is this:

https://sourcegraph.com/github.com/ray-project/ray@master/-/blob/src/ray/protobuf/gcs.proto?L311

Could you add some runtime envs there, for example, a list of OS envs and some working dir to see the perfs? You might need to run it on your dev. I think measuring the throughput of tasks are good enough.

Jan 06 '23 20:01 fishbone

run single_node microbenchmark with some dummy runtime env: (full script https://gist.github.com/scv119/dc5ff45f9172f4ac4e548cc2c57bc460)

With this PR:

Many args time: 17.709704568999996 (10000 args)
Many returns time: 6.086816375999888 (3000 returns)
Ray.get time: 25.223439912999993 (10000 args)
Queued task time: 205.9235447819999 (1000000 tasks)
Ray.get large object time: 258.166412434 (107374182400 bytes)

master

Many args time: 17.972216645000117 (10000 args)
Many returns time: 5.942020074000084 (3000 returns)
Ray.get time: 25.89167107900016 (10000 args)
Queued task time: 206.02459286499993 (1000000 tasks)
Ray.get large object time: 256.5160020129997 (107374182400 bytes)

Jan 07 '23 05:01 scv119

There are some test failures...

Jan 08 '23 16:01 rkooo567

Btw for java failures: Looks like most of issues are related to runtime context, meaning it is probably that they call this API before the job config is synced to workers (which seems pretty odd)

Jan 10 '23 03:01 rkooo567

I'll disable java test for now. @MisterLin1995 will fix the java test in the follow up PRs #31590 #31593

Jan 11 '23 19:01 scv119

ray
ray copied to clipboard

[CoreWorker] lazy bind core_work's job_config through task spec.

Why are these changes needed?

Related issue number

Checks

ray ray copied to clipboard

[CoreWorker] lazy bind core_work's job_config through task spec.

Why are these changes needed?

Related issue number

Checks

ray
ray copied to clipboard