[core] Make preloading Jemalloc configurable for worker
Why are these changes needed?
The PR https://github.com/ray-project/ray/pull/39446 disables preloading Jemalloc for workers totally. However, Jemalloc is still useful in some cases, and we could make it configurable if user setting env RAY_LD_PRELOAD as 0.
I did a inference test with limited memory, and we can see the OOM counts decrease from 900+ to 700.
Related issue number
Closes #47242
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
@fishbone @rkooo567 could you please take a review?
Please add your inference example and benchmark as well.
can you add a new section in this page to explain how to enable jemalloc for workers? https://docs.ray.io/en/master/ray-core/miscellaneous.html#tuning-ray-settings
@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?
Please add your inference example and benchmark as well.
@hongchaodeng I have added the example in the description, and I think the previous pasted image could describe the benchmark well.
@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?
I assume it is more of performance feature (that optimizes the memory usage). is enabling jemalloc allowing you to do mem profiling as well?
@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?
I assume it is more of performance feature (that optimizes the memory usage). is enabling jemalloc allowing you to do mem profiling as well?
I see, enabling Jemalloc for workers could also benefit for memory profiling, I will mentation this in the tuning and profiling page.
We also need to enable jemalloc for the worker processes because some C++ code is wrapped into Python workers, jemalloc is very effective for optimizing the memory of this part of c++ code.
@Myasuka this makes sense to me, is this still something you want to merge?
@Myasuka this makes sense to me, is this still something you want to merge?
Sure, could you help to review and merge this PR? I will update this PR this week.
This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.
You can always ask for help on our discussion forum or Ray's public slack channel.
If you'd like to keep this open, just leave any comment, and the stale label will be removed.
@rkooo567 since https://docs.ray.io/en/master/ray-core/miscellaneous.html#tuning-ray-settings have been removed, I still add the docs in https://docs.ray.io/en/master/ray-contribute/profiling.html#memory-profiling
@dayshah Please also take a review on this update.
@dayshah Just wondering, any reason why we're using 0/1 here instead of true/false? If this is a user-facing concept and the values are equivalent, the boolean operators are more friendly to humans.
@dayshah Just wondering, any reason why we're using 0/1 here instead of true/false? If this is a user-facing concept and the values are equivalent, the boolean operators are more friendly to humans.
I think it's because 0 and 1 usually translate to boolean values in C++, and the environment variables used in C++ are set using 0 or 1, like these https://github.com/ray-project/ray/blob/255f7a92ad608dbeed880ade608aed26de567ddf/src/ray/common/ray_config_def.h#L107
I don't see a reason to not use true/false here though, since we're checking against the number.
generally lgtm, can you address the doc comments
Also, I would prefer if we rename the
RAY_LD_PRELOADenv variable to something likeRAY_LD_PRELOAD_ON_WORKERS, which gets set to 0 by default. This makes it easier for a user to understand without looking through code and docs.
I think RAY_LD_PRELOAD_ON_WORKERS looks more readable, and we could setting it as false by default to align with previous behavior.
I have updated this PR, please take a look @dayshah @dstrodtman
https://buildkite.com/ray-project/premerge/builds/41233 some lint failures
https://buildkite.com/ray-project/premerge/builds/41233 some lint failures
@dayshah updated and all tests passes.