ray
ray copied to clipboard
[CoreWorker] lazy bind core_work's job_config through task spec.
Why are these changes needed?
Previously the worker get job_config information from raylet on construction. This prevents us from lazily binding job_config to workers. This PR enables lazily bind job_config, by piggybacking job_confg in TaskSpec, and initialize the job_config when the worker receives task execution request (push_task) call.
We also refactor the WorkerContext and RayletClient as part of the chagne.
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s
) in this PR. - [ ] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
@liuyang-my the Java test failed but i'm not quite sure what exactly happened reading the logs. Do you know what might have gone wrong? (presumably we are hitting some deadlock issues?) https://buildkite.com/ray-project/oss-ci-build-pr/builds/8396#01857127-086b-406b-92da-6f935dcc8447 is the failed test
Question: will this slow down the perfs? I think this adds runtime env to all task specs (previously, only one). Do you mind benchmarking the perf regression?
Besides this, do you think it's good to pass job config through stdin for the workers? If doing this way, we probably could limit the all changes in worker pool.
I'm also thinking in the future this maybe need extension. We probably don't want to pass everything to task spec I believe.
Btw, ok with this approach if the benchmark with job config is ok. But let's add comment to job config proto to let people know it's passed to all tasks repeatedly.
Still reviewing...
thanks for reviewing!
kicking off benchmark here: https://buildkite.com/ray-project/release-tests-pr/builds/24753
No external behavior change, the only thing changes is we delay the job_config binding until it receives the first task_spec.
@scv119 the default microbenchmark doesn't use job config. What concerns me is this:
https://sourcegraph.com/github.com/ray-project/ray@master/-/blob/src/ray/protobuf/gcs.proto?L311
Could you add some runtime envs there, for example, a list of OS envs and some working dir to see the perfs? You might need to run it on your dev. I think measuring the throughput of tasks are good enough.
run single_node microbenchmark with some dummy runtime env: (full script https://gist.github.com/scv119/dc5ff45f9172f4ac4e548cc2c57bc460)
With this PR:
Many args time: 17.709704568999996 (10000 args)
Many returns time: 6.086816375999888 (3000 returns)
Ray.get time: 25.223439912999993 (10000 args)
Queued task time: 205.9235447819999 (1000000 tasks)
Ray.get large object time: 258.166412434 (107374182400 bytes)
Many args time: 17.972216645000117 (10000 args)
Many returns time: 5.942020074000084 (3000 returns)
Ray.get time: 25.89167107900016 (10000 args)
Queued task time: 206.02459286499993 (1000000 tasks)
Ray.get large object time: 256.5160020129997 (107374182400 bytes)
There are some test failures...
Btw for java failures: Looks like most of issues are related to runtime context, meaning it is probably that they call this API before the job config is synced to workers (which seems pretty odd)
I'll disable java test for now. @MisterLin1995 will fix the java test in the follow up PRs #31590 #31593