ray icon indicating copy to clipboard operation
ray copied to clipboard

[core] Make preloading Jemalloc configurable for worker

Open Myasuka opened this issue 1 year ago • 2 comments

Why are these changes needed?

The PR https://github.com/ray-project/ray/pull/39446 disables preloading Jemalloc for workers totally. However, Jemalloc is still useful in some cases, and we could make it configurable if user setting env RAY_LD_PRELOAD as 0.

I did a inference test with limited memory, and we can see the OOM counts decrease from 900+ to 700. image

Related issue number

Closes #47242

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

Myasuka avatar Aug 21 '24 09:08 Myasuka

@fishbone @rkooo567 could you please take a review?

Myasuka avatar Aug 22 '24 12:08 Myasuka

Please add your inference example and benchmark as well.

hongchaodeng avatar Aug 28 '24 19:08 hongchaodeng

can you add a new section in this page to explain how to enable jemalloc for workers? https://docs.ray.io/en/master/ray-core/miscellaneous.html#tuning-ray-settings

@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?

Myasuka avatar Aug 29 '24 12:08 Myasuka

Please add your inference example and benchmark as well.

@hongchaodeng I have added the example in the description, and I think the previous pasted image could describe the benchmark well.

Myasuka avatar Aug 29 '24 12:08 Myasuka

@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?

I assume it is more of performance feature (that optimizes the memory usage). is enabling jemalloc allowing you to do mem profiling as well?

rkooo567 avatar Aug 29 '24 16:08 rkooo567

@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?

I assume it is more of performance feature (that optimizes the memory usage). is enabling jemalloc allowing you to do mem profiling as well?

I see, enabling Jemalloc for workers could also benefit for memory profiling, I will mentation this in the tuning and profiling page.

Myasuka avatar Sep 05 '24 13:09 Myasuka

We also need to enable jemalloc for the worker processes because some C++ code is wrapped into Python workers, jemalloc is very effective for optimizing the memory of this part of c++ code.

MissiontoMars avatar Oct 30 '24 07:10 MissiontoMars

@Myasuka this makes sense to me, is this still something you want to merge?

dayshah avatar May 06 '25 20:05 dayshah

@Myasuka this makes sense to me, is this still something you want to merge?

Sure, could you help to review and merge this PR? I will update this PR this week.

Myasuka avatar May 12 '25 09:05 Myasuka

This pull request has been automatically marked as stale because it has not had any activity for 14 days. It will be closed in another 14 days if no further activity occurs. Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

github-actions[bot] avatar Jun 02 '25 00:06 github-actions[bot]

@rkooo567 since https://docs.ray.io/en/master/ray-core/miscellaneous.html#tuning-ray-settings have been removed, I still add the docs in https://docs.ray.io/en/master/ray-contribute/profiling.html#memory-profiling

@dayshah Please also take a review on this update.

Myasuka avatar Jun 02 '25 18:06 Myasuka

@dayshah Just wondering, any reason why we're using 0/1 here instead of true/false? If this is a user-facing concept and the values are equivalent, the boolean operators are more friendly to humans.

dstrodtman avatar Jun 02 '25 18:06 dstrodtman

@dayshah Just wondering, any reason why we're using 0/1 here instead of true/false? If this is a user-facing concept and the values are equivalent, the boolean operators are more friendly to humans.

I think it's because 0 and 1 usually translate to boolean values in C++, and the environment variables used in C++ are set using 0 or 1, like these https://github.com/ray-project/ray/blob/255f7a92ad608dbeed880ade608aed26de567ddf/src/ray/common/ray_config_def.h#L107

I don't see a reason to not use true/false here though, since we're checking against the number.

dayshah avatar Jun 02 '25 19:06 dayshah

generally lgtm, can you address the doc comments

Also, I would prefer if we rename the RAY_LD_PRELOAD env variable to something like RAY_LD_PRELOAD_ON_WORKERS, which gets set to 0 by default. This makes it easier for a user to understand without looking through code and docs.

I think RAY_LD_PRELOAD_ON_WORKERS looks more readable, and we could setting it as false by default to align with previous behavior.

Myasuka avatar Jun 03 '25 14:06 Myasuka

I have updated this PR, please take a look @dayshah @dstrodtman

Myasuka avatar Jun 03 '25 15:06 Myasuka

https://buildkite.com/ray-project/premerge/builds/41233 some lint failures

dayshah avatar Jun 03 '25 16:06 dayshah

https://buildkite.com/ray-project/premerge/builds/41233 some lint failures

@dayshah updated and all tests passes.

Myasuka avatar Jun 05 '25 08:06 Myasuka