OpenRLHF icon indicating copy to clipboard operation
OpenRLHF copied to clipboard

Why CPU Memory Is Used More and More?

Open shuoyinn opened this issue 6 months ago • 23 comments

Thanks for your contribution! Very great work!

When I used ppo ray (grpo), I found the RAM was like this during training:

Image

Observations:

  1. At each saving point, the cache becomes obviously larger
  2. Though not that apparent like the saving points, the rss of RAM occupied indeed increases w.r.t. steps I've tried all pin_memory=False for deepspeed dataloader/zero, but the curves didn't change (still increasing)

Is there some cpu variables (like cpu tensors) unexpectedly living in the cpu memory? And what makes saving a deepspeed ckpt influence the RAM cache?

shuoyinn avatar Jun 18 '25 19:06 shuoyinn

Do you use vLLM sleep? And could you try gc.collect() and ray.internal.free_objects() after each training step?

hijkzzz avatar Jun 19 '25 00:06 hijkzzz

After I set saving steps as 1000000 (save no ckpt), the RAM cache (orange) didn't increased like a ladder anymore. So it seems that actor model saving function of deepspeed is the reason.

shuoyinn avatar Jun 19 '25 02:06 shuoyinn

Do you use vLLM sleep? And could you try gc.collect() and ray.internal.free_objects() after each training step?

Thank you for your prompt reply!

I used vLLM sleep as suggested in your doc, but I have not tried gc.collect() and ray.internal.free_objects(). I will try them and report the results here soon.

shuoyinn avatar Jun 19 '25 02:06 shuoyinn

Thanks please try this: https://github.com/OpenRLHF/OpenRLHF/commit/348e8b4ee0e2309e549644b3b413eca0fe1367df

hijkzzz avatar Jun 19 '25 03:06 hijkzzz

Thanks please try this: 348e8b4

Hello, I've tried and this doesn't work. The RAM footprint still increases like a ladder (orange):

Image

I found if I didn't use adam_offload then the yellow curve (rss) would not rise (or just rise very slightly). But as for the non-negligible increase (the orange curve) brought by every ckpt saving step, I really cannot locate the reason. All I know is that no ckpt saving won't result in this problem.

shuoyinn avatar Jun 19 '25 05:06 shuoyinn

I need train 4k steps and cpu OOM (> 1k steps for me) will incur this ray error:

The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

shuoyinn avatar Jun 19 '25 06:06 shuoyinn

Do you use save_hf_ckpt or just save deepseed checkpoint ? It think this should be a bug in DeepSpeed. I have forwarded this issue to DeepSpeed github repo https://github.com/deepspeedai/DeepSpeed/issues/7370.

hijkzzz avatar Jun 19 '25 07:06 hijkzzz

Do you use save_hf_ckpt or just save deepseed checkpoint ? It think this should be a bug in DeepSpeed. I have forwarded this issue to DeepSpeed github repo deepspeedai/DeepSpeed#7370.

I've tried disabling --save_hf_ckpt but it doesn't work, so the issue seems to be related to DeepSpeed saving function deepspeed.DeepSpeedEngine.save_checkpoint. My deepspeed version is 0.16.4.

Could you please reproduce this issue to exclude the factors about my own environment?

shuoyinn avatar Jun 19 '25 08:06 shuoyinn

Do you use save_hf_ckpt or just save deepseed checkpoint ? It think this should be a bug in DeepSpeed. I have forwarded this issue to DeepSpeed github repo deepspeedai/DeepSpeed#7370.

I've tried disabling --save_hf_ckpt but it doesn't work, so the issue seems to be related to DeepSpeed saving function deepspeed.DeepSpeedEngine.save_checkpoint. My deepspeed version is 0.16.4.

Could you please reproduce this issue to exclude the factors about my own environment?

I have never encountered the issue about RAM using deepspeed in other scenarios, which also needs saving many ckpts (like SFT).

shuoyinn avatar Jun 19 '25 08:06 shuoyinn

Do you use save_hf_ckpt or just save deepseed checkpoint ? It think this should be a bug in DeepSpeed. I have forwarded this issue to DeepSpeed github repo deepspeedai/DeepSpeed#7370.

I've tried disabling --save_hf_ckpt but it doesn't work, so the issue seems to be related to DeepSpeed saving function deepspeed.DeepSpeedEngine.save_checkpoint. My deepspeed version is 0.16.4. Could you please reproduce this issue to exclude the factors about my own environment?

I have never encountered the issue about RAM using deepspeed in other scenarios, which also needs saving many ckpts (like SFT).

Try deepspeed 0.17.1 ? I thinks this is because you are not use adam_offload in SFT.

hijkzzz avatar Jun 19 '25 08:06 hijkzzz

Do you use save_hf_ckpt or just save deepseed checkpoint ? It think this should be a bug in DeepSpeed. I have forwarded this issue to DeepSpeed github repo deepspeedai/DeepSpeed#7370.

I've tried disabling --save_hf_ckpt but it doesn't work, so the issue seems to be related to DeepSpeed saving function deepspeed.DeepSpeedEngine.save_checkpoint. My deepspeed version is 0.16.4. Could you please reproduce this issue to exclude the factors about my own environment?

I have never encountered the issue about RAM using deepspeed in other scenarios, which also needs saving many ckpts (like SFT).

Try deepspeed 0.17.1 ?

I've tried deepspeed 0.17.1 and I found deepspeed version was not the reason. But now I can make my issue clearer.

I thinks this is because you are not use adam_offload in SFT.

Yes, you are right, adam_offload is an obvious difference between OpenRLHF RL and the common deepspeed SFT settings I always use.

The new observations are as follows:

  1. set --max_ckpt_num as 3, simultaneously disable --adam_offload, the curve increases 3 times (w.r.t. 3 ckpt saving steps) but will be stable afterwards
Image
  1. set --max_ckpt_num as 10000 (infinite), the curve will rise until cpu OOM (incurring ray error)
Image
  1. set --max_ckpt_num as 3, simultaneously enable --adam_offload, the curve first increases 3 times obviously (w.r.t. 3 ckpt saving steps) and will continue rising afterwards but more slowly than the first 3 step
Image

shuoyinn avatar Jun 19 '25 12:06 shuoyinn

However, although I can set --max_ckpt_num as a small value and disable --adam_offload to avoid cpu OOM, I think it is still a bug. That is because I don't think either --max_ckpt_num or --adam_offload should increase the cpu memory footprint theoretically (the former is designed to save disk usage and the latter for cuda memory).

shuoyinn avatar Jun 19 '25 12:06 shuoyinn

However, although I can set --max_ckpt_num as a small value and disable --adam_offload to avoid cpu OOM, I think it is still a bug. That is because I don't think either --max_ckpt_num or --adam_offload should increase the cpu memory footprint theoretically (the former is designed to save disk usage and the latter for cuda memory).

In fact, one of good RL strategies is saving every ckpt and select an intermediate one to resume with new settings (like filtering, longer context, etc.). For me, my task at handle needs 4k global steps and saving one ckpt every 200 steps, but the cpu OOM makes me have to resume training after ray error occurs, instead of finishing the training all at once.

shuoyinn avatar Jun 19 '25 12:06 shuoyinn

I've updated the issue details in DeepSpeed github repo https://github.com/deepspeedai/DeepSpeed/issues/7370

shuoyinn avatar Jun 19 '25 12:06 shuoyinn

I suspect that your machine is caching disk writes in memory.

hijkzzz avatar Jun 19 '25 14:06 hijkzzz

Hello, I think I was wrong on tracing the error. I checked the training log again and found maybe it is nothing to do with cpu OOM (at least not directly), i.e., RAM footprint increasing is not the reason incurring the ray error. Unfortunately, maybe it's a much more complicated issue.

Here are my log:

[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  22%|██▏       | 7/32 [00:05<00:23,  1.05it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  25%|██▌       | 8/32 [00:06<00:24,  1.02s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  28%|██▊       | 9/32 [00:07<00:18,  1.22it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  31%|███▏      | 10/32 [00:07<00:14,  1.48it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  34%|███▍      | 11/32 [00:08<00:12,  1.73it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  38%|███▊      | 12/32 [00:08<00:10,  1.97it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  41%|████      | 13/32 [00:09<00:12,  1.48it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  44%|████▍     | 14/32 [00:10<00:15,  1.13it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  47%|████▋     | 15/32 [00:11<00:16,  1.04it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  50%|█████     | 16/32 [00:13<00:15,  1.00it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  53%|█████▎    | 17/32 [00:14<00:15,  1.05s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  56%|█████▋    | 18/32 [00:15<00:15,  1.08s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  59%|█████▉    | 19/32 [00:16<00:14,  1.08s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  62%|██████▎   | 20/32 [00:17<00:12,  1.08s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  66%|██████▌   | 21/32 [00:18<00:11,  1.04s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  69%|██████▉   | 22/32 [00:19<00:10,  1.03s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  72%|███████▏  | 23/32 [00:20<00:09,  1.01s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  75%|███████▌  | 24/32 [00:21<00:07,  1.00it/s][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  78%|███████▊  | 25/32 [00:22<00:07,  1.06s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  81%|████████▏ | 26/32 [00:23<00:06,  1.06s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  84%|████████▍ | 27/32 [00:24<00:05,  1.07s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  88%|████████▊ | 28/32 [00:25<00:04,  1.06s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  91%|█████████ | 29/32 [00:27<00:03,  1.10s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  94%|█████████▍| 30/32 [00:28<00:02,  1.13s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience:  97%|█████████▋| 31/32 [00:29<00:01,  1.16s/it][A
[36m(ActorModelRayActor pid=1216208)[0m 
[36m(ActorModelRayActor pid=1216208)[0m 
make_experience: 100%|██████████| 32/32 [00:30<00:00,  1.18s/it][A
make_experience: 100%|██████████| 32/32 [00:30<00:00,  1.04it/s]
[36m(ActorModelRayActor pid=1216208)[0m 
Episode [1/1]:  39%|███▉      | 1318/3382 [44:50:12<79:47:08, 139.16s/it, accuracy_rewards_original=0.656]         
[36m(ActorModelRayActor pid=1216208)[0m 
Episode [1/1]:  39%|███▉      | 1319/3382 [44:50:12<71:51:05, 125.38s/it, accuracy_rewards_original=0.656]
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[36m(LLMRayActor pid=1179176, ip=10.136.148.90)[0m 
Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][32m [repeated 13x across cluster][0m
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:   2%|▏         | 1/64 [00:18<19:28, 18.55s/it, est. speed input: 64.00 toks/s, output: 6.85 toks/s]
[36m(LLMRayActor pid=1212157)[0m 
Processed prompts:   0%|          | 0/64 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][32m [repeated 2x across cluster][0m
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:   3%|▎         | 2/64 [00:18<07:58,  7.72s/it, est. speed input: 238.68 toks/s, output: 13.86 toks/s]
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:  12%|█▎        | 8/64 [00:18<01:13,  1.30s/it, est. speed input: 597.01 toks/s, output: 56.48 toks/s]
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:  20%|██        | 13/64 [00:18<00:33,  1.50it/s, est. speed input: 1064.73 toks/s, output: 93.46 toks/s]
Processed prompts:  33%|███▎      | 21/64 [00:19<00:13,  3.15it/s, est. speed input: 2240.62 toks/s, output: 153.59 toks/s]
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:  42%|████▏     | 27/64 [00:19<00:07,  4.70it/s, est. speed input: 3036.92 toks/s, output: 199.05 toks/s]
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:  64%|██████▍   | 41/64 [00:19<00:02,  9.89it/s, est. speed input: 4597.32 toks/s, output: 310.43 toks/s]
Processed prompts:  75%|███████▌  | 48/64 [00:19<00:01, 13.03it/s, est. speed input: 5530.23 toks/s, output: 366.91 toks/s]
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:  86%|████████▌ | 55/64 [00:19<00:00, 16.61it/s, est. speed input: 6357.24 toks/s, output: 423.91 toks/s]
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts:  97%|█████████▋| 62/64 [00:20<00:00, 16.88it/s, est. speed input: 7048.50 toks/s, output: 477.63 toks/s]
[36m(LLMRayActor pid=1179177, ip=10.136.148.90)[0m 
Processed prompts: 100%|██████████| 64/64 [00:21<00:00,  3.03it/s, est. speed input: 6909.93 toks/s, output: 474.11 toks/s]
[36m(LLMRayActor pid=1212154)[0m 
Processed prompts:  78%|███████▊  | 50/64 [00:23<00:00, 14.42it/s, est. speed input: 5988.04 toks/s, output: 331.29 toks/s][32m [repeated 18x across cluster][0m
[36m(LLMRayActor pid=1179178, ip=10.136.148.90)[0m 
Processed prompts:   2%|▏         | 1/64 [00:23<24:25, 23.26s/it, est. speed input: 160.85 toks/s, output: 5.59 toks/s]
Processed prompts:   3%|▎         | 2/64 [00:23<09:58,  9.65s/it, est. speed input: 210.53 toks/s, output: 11.29 toks/s][32m [repeated 3x across cluster][0m
[36m(LLMRayActor pid=1211998)[0m 
Processed prompts: 100%|██████████| 64/64 [00:23<00:00, 16.28it/s, est. speed input: 7531.46 toks/s, output: 404.94 toks/s]
Processed prompts: 100%|██████████| 64/64 [00:23<00:00,  2.77it/s, est. speed input: 7531.46 toks/s, output: 404.94 toks/s]
[36m(LLMRayActor pid=1179178, ip=10.136.148.90)[0m 
Processed prompts: 100%|██████████| 64/64 [00:24<00:00,  2.61it/s, est. speed input: 7715.56 toks/s, output: 395.57 toks/s][32m [repeated 5x across cluster][0m
[36m(LLMRayActor pid=1212160)[0m 
Processed prompts:  84%|████████▍ | 54/64 [00:22<00:00, 17.27it/s, est. speed input: 6083.70 toks/s, output: 360.03 toks/s]
Processed prompts:  94%|█████████▍| 60/64 [00:22<00:00, 21.39it/s, est. speed input: 6895.12 toks/s, output: 404.11 toks/s]
[36m(LLMRayActor pid=1212159)[0m Fatal Python error: none_dealloc: deallocating None
[36m(LLMRayActor pid=1212159)[0m Python runtime state: initialized
[36m(LLMRayActor pid=1212159)[0m 
[36m(LLMRayActor pid=1212159)[0m Thread 0x00007f140bfff700 (most recent call first):
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 324 in wait
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 600 in wait
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 1009 in _bootstrap_inner
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 966 in _bootstrap
[36m(LLMRayActor pid=1212159)[0m 
[36m(LLMRayActor pid=1212159)[0m Thread 0x00007f1615fff700 (most recent call first):
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/usage/usage_lib.py", line 220 in _report_continous_usage
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/usage/usage_lib.py", line 163 in _report_usage_worker
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 946 in run
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 1009 in _bootstrap_inner
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 966 in _bootstrap
[36m(LLMRayActor pid=1212159)[0m 
[36m(LLMRayActor pid=1212159)[0m Thread 0x00007f165d64a700 (most recent call first):
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 324 in wait
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 600 in wait
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 1009 in _bootstrap_inner
[36m(LLMRayActor pid=1212159)[0m   File "/usr/local/lib/python3.10/threading.py", line 966 in _bootstrap
[36m(LLMRayActor pid=1212159)[0m 
[36m(LLMRayActor pid=1212159)[0m Current thread 0x00007f485b20b740 (most recent call first):
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/core/scheduler.py", line 779 in _schedule_running
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1244 in _schedule_default
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1445 in _schedule
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/core/scheduler.py", line 1486 in schedule
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1341 in step
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1397 in _run_engine
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 469 in generate
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/vllm/utils.py", line 1057 in inner
[36m(LLMRayActor pid=1212159)[0m   File "/mnt/bn/xxxxxxxxxxx/openrlhf/trainer/ray/vllm_engine.py", line 96 in add_requests
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467 in _resume_span
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/ray/_private/function_manager.py", line 696 in actor_method_executor
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 922 in main_loop
[36m(LLMRayActor pid=1212159)[0m   File "/home/tiger/.local/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 289 in <module>
[36m(LLMRayActor pid=1212159)[0m 
[36m(LLMRayActor pid=1212159)[0m Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.utils, av.option, av.descriptor, av.format, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.pad, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, sklearn.__check_build._check_build, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, pyarrow._json, regex._regex, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, vllm.cumem_allocator, sentencepiece._sentencepiece, cuda_utils, __triton_launcher (total: 261)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/bn/xxxxxxxxxxx/openrlhf/cli/train_ppo_ray.py", line 502, in <module>
    train(args)
  File "/mnt/bn/xxxxxxxxxxx/openrlhf/cli/train_ppo_ray.py", line 184, in train
    ray.get(refs)
  File "/home/tiger/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2755, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/tiger/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 906, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): [36mray::ActorModelRayActor.fit()[39m (pid=1216365, ip=10.136.144.146, actor_id=64188146f2db2acdb6b823fc03000000, repr=<openrlhf.trainer.ray.ppo_actor.ActorModelRayActor object at 0x7f6dad7f6140>)
  File "/mnt/bn/xxxxxxxxxxx/openrlhf/trainer/ray/ppo_actor.py", line 563, in fit
  File "/mnt/bn/xxxxxxxxxxx/openrlhf/trainer/ppo_trainer.py", line 277, in fit
    )
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/bn/xxxxxxxxxxx/openrlhf/trainer/ppo_utils/experience_maker.py", line 757, in make_experience_list
    }
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/bn/xxxxxxxxxxx/openrlhf/trainer/ppo_utils/experience_maker.py", line 221, in make_experience_list
    )
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/bn/xxxxxxxxxxx/openrlhf/trainer/ppo_utils/experience_maker.py", line 781, in generate_samples
  File "/mnt/bn/xxxxxxxxxxx/openrlhf/trainer/ppo_utils/experience_maker.py", line 1091, in _generate_vllm
    for p, imgs in zip(prompts, images)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: LLMRayActor
	actor_id: de8613dba9b51c9cda24247003000000
	pid: 1212159
	namespace: 93a34994-e0c1-40a6-b2f7-7029ebecaea1
	ip: 10.136.144.146
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[36m(LLMRayActor pid=1212158)[0m 
Processed prompts:  56%|█████▋    | 36/64 [00:27<00:05,  5.54it/s, est. speed input: 4297.07 toks/s, output: 190.02 toks/s][32m [repeated 33x across cluster][0m
[36m(LLMRayActor pid=1212160)[0m 
Processed prompts: 100%|██████████| 64/64 [00:23<00:00,  2.74it/s, est. speed input: 7246.53 toks/s, output: 429.06 toks/s]
[36m(LLMRayActor pid=1179173, ip=10.136.148.90)[0m 
Processed prompts:  12%|█▎        | 8/64 [00:24<01:24,  1.52s/it, est. speed input: 1050.93 toks/s, output: 44.08 toks/s][32m [repeated 5x across cluster][0m
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m update weight: visual.blocks.0.norm1.weight, dtype: torch.bfloat16, shape: torch.Size([1280])
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m update weight: visual.blocks.0.norm2.weight, dtype: torch.bfloat16, shape: torch.Size([1280])
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m update weight: visual.blocks.0.attn.qkv.weight, dtype: torch.bfloat16, shape: torch.Size([3840, 1280])
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m update weight: visual.blocks.0.attn.qkv.bias, dtype: torch.bfloat16, shape: torch.Size([3840])
[36m(LLMRayActor pid=1212157)[0m INFO 06-20 01:33:47 executor_base.py:219] It took 3.501426 seconds to wake up.[32m [repeated 15x across cluster][0m
[36m(LLMRayActor pid=1179179, ip=10.136.148.90)[0m update weight: visual.patch_embed.proj.weight, dtype: torch.bfloat16, shape: torch.Size([1280, 3, 2, 14, 14])[32m [repeated 15x across cluster][0m
[36m(LLMRayActor pid=1212158)[0m update weight: visual.blocks.1.mlp.gate_proj.weight, dtype: torch.bfloat16, shape: torch.Size([3420, 1280])[32m [repeated 268x across cluster][0m
[36m(LLMRayActor pid=1179176, ip=10.136.148.90)[0m 
[36m(LLMRayActor pid=1211998)[0m 
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m update weight: visual.merger.ln_q.weight, dtype: torch.bfloat16, shape: torch.Size([1280])
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m update weight: model.embed_tokens.weight, dtype: torch.bfloat16, shape: torch.Size([152064, 3584])
[36m(LLMRayActor pid=1212155)[0m update weight: model.layers.1.mlp.down_proj.weight, dtype: torch.bfloat16, shape: torch.Size([3584, 18944])[32m [repeated 6153x across cluster][0m
[36m(LLMRayActor pid=1212158)[0m update weight: visual.merger.ln_q.weight, dtype: torch.bfloat16, shape: torch.Size([1280])[32m [repeated 15x across cluster][0m
[36m(LLMRayActor pid=1212158)[0m update weight: model.embed_tokens.weight, dtype: torch.bfloat16, shape: torch.Size([152064, 3584])[32m [repeated 15x across cluster][0m
[36m(LLMRayActor pid=1212155)[0m update weight: model.layers.4.self_attn.k_proj.weight, dtype: torch.bfloat16, shape: torch.Size([512, 3584])[32m [repeated 584x across cluster][0m
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m update weight: model.layers.26.input_layernorm.weight, dtype: torch.bfloat16, shape: torch.Size([3584])[32m [repeated 4353x across cluster][0m
[36m(LLMRayActor pid=1179175, ip=10.136.148.90)[0m update weight: model.norm.weight, dtype: torch.bfloat16, shape: torch.Size([3584])
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m update weight: lm_head.weight, dtype: torch.bfloat16, shape: torch.Size([152064, 3584])
[36m(LLMRayActor pid=1179174, ip=10.136.148.90)[0m INFO 06-20 01:34:09 worker.py:133] Sleep mode freed 46.46 GiB memory, 22.76 GiB memory is still in use.
[36m(LLMRayActor pid=1179174, ip=10.136.148.90)[0m INFO 06-20 01:34:09 executor_base.py:208] It took 1.533544 seconds to fall asleep.
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m INFO 06-20 01:34:11 executor_base.py:219] It took 1.557426 seconds to wake up.
[36m(LLMRayActor pid=1179179, ip=10.136.148.90)[0m update weight: model.layers.27.post_attention_layernorm.weight, dtype: torch.bfloat16, shape: torch.Size([3584])[32m [repeated 222x across cluster][0m
[36m(LLMRayActor pid=1179179, ip=10.136.148.90)[0m update weight: model.norm.weight, dtype: torch.bfloat16, shape: torch.Size([3584])[32m [repeated 15x across cluster][0m
[36m(LLMRayActor pid=1212158)[0m update weight: lm_head.weight, dtype: torch.bfloat16, shape: torch.Size([152064, 3584])[32m [repeated 15x across cluster][0m
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m INFO 06-20 01:35:00 worker.py:133] Sleep mode freed 46.46 GiB memory, 25.07 GiB memory is still in use.[32m [repeated 16x across cluster][0m
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m INFO 06-20 01:35:00 executor_base.py:208] It took 1.735670 seconds to fall asleep.[32m [repeated 16x across cluster][0m
[36m(LLMRayActor pid=1212158)[0m INFO 06-20 01:34:14 executor_base.py:219] It took 3.758021 seconds to wake up.[32m [repeated 15x across cluster][0m
[36m(ActorModelRayActor pid=1183375, ip=10.136.148.90)[0m Filtered 28 experiences.
[36m(LLMRayActor pid=1212158)[0m INFO 06-20 01:35:00 worker.py:133] Sleep mode freed 46.35 GiB memory, 26.85 GiB memory is still in use.[32m [repeated 15x across cluster][0m
[36m(LLMRayActor pid=1212158)[0m INFO 06-20 01:35:00 executor_base.py:208] It took 2.326313 seconds to fall asleep.[32m [repeated 15x across cluster][0m
[36m(ActorModelRayActor pid=1216364)[0m Filtered 24 experiences.[32m [repeated 7x across cluster][0m
[36m(ActorModelRayActor pid=1216366)[0m Filtered 32 experiences.[32m [repeated 7x across cluster][0m
[36m(LLMRayActor pid=1179180, ip=10.136.148.90)[0m INFO 06-20 01:35:38 executor_base.py:219] It took 1.423191 seconds to wake up.
[36m(ActorModelRayActor pid=1216365)[0m Filtered 28 experiences.
[36m(LLMRayActor pid=1179174, ip=10.136.148.90)[0m INFO 06-20 01:36:37 worker.py:133] Sleep mode freed 46.46 GiB memory, 21.74 GiB memory is still in use.
[36m(LLMRayActor pid=1179174, ip=10.136.148.90)[0m INFO 06-20 01:36:37 executor_base.py:208] It took 1.559335 seconds to fall asleep.
[36m(LLMRayActor pid=1212158)[0m INFO 06-20 01:35:39 executor_base.py:219] It took 2.926714 seconds to wake up.[32m [repeated 15x across cluster][0m
[36m(ActorModelRayActor pid=1183372, ip=10.136.148.90)[0m Filtered 28 experiences.
[36m(LLMRayActor pid=1212157)[0m INFO 06-20 01:36:37 worker.py:133] Sleep mode freed 46.35 GiB memory, 26.10 GiB memory is still in use.[32m [repeated 15x across cluster][0m
[36m(LLMRayActor pid=1212157)[0m INFO 06-20 01:36:37 executor_base.py:208] It took 2.026004 seconds to fall asleep.[32m [repeated 15x across cluster][0m
[36m(ActorModelRayActor pid=1183368, ip=10.136.148.90)[0m Filtered 28 experiences.[32m [repeated 5x across cluster][0m
[36m(LLMRayActor pid=1179173, ip=10.136.148.90)[0m INFO 06-20 01:37:11 executor_base.py:219] It took 1.462464 seconds to wake up.
[36m(ActorModelRayActor pid=1216360)[0m Filtered 32 experiences.[32m [repeated 10x across cluster][0m
[33m(raylet)[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffde8613dba9b51c9cda24247003000000 Worker ID: 668d502f679d43e30ddd8d92ad043a22430f6d1edd8aa30f0b35886d Node ID: 5bc88bae137e40d6bfe591eb2fe9ee55a174696302cb0c9d2f627f69 Worker IP address: 10.136.144.146 Worker port: 11014 Worker PID: 1212159 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[36m(LLMRayActor pid=1212158)[0m INFO 06-20 01:37:13 executor_base.py:219] It took 3.511680 seconds to wake up.[32m [repeated 15x across cluster][0m

I've mentioned the ray error log before, as The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

These days I have been trying hard to figure out why cpu OOM. But maybe the point is vllm sleep. Similar issues are https://github.com/OpenRLHF/OpenRLHF/issues/1052 (OpenRLHF), https://github.com/modelscope/ms-swift/issues/4353 (ms-swift) and https://github.com/modelscope/ms-swift/issues/4353 (ms-swift)

shuoyinn avatar Jun 20 '25 04:06 shuoyinn

For me, I have tried three runs. For every run, at some step between 1k to 1.5k, this error occurred.

shuoyinn avatar Jun 20 '25 04:06 shuoyinn

Try vLLM 0.9.1 ?

hijkzzz avatar Jun 20 '25 05:06 hijkzzz

Try vLLM 0.9.1 ?

In their issues they mention the vLLM versions used include 0.8.5.post1 and 0.8.4. Actually 0.9.1 is too high for my environment, considering the version of torch (and some other packages) vllm 0.9.1 requires.

I'm afraid I can't conduct the experiment immediately. But I will keep following up.

shuoyinn avatar Jun 20 '25 06:06 shuoyinn

Try vLLM 0.9.1 ?

In their issues they mention the vLLM versions used include 0.8.5.post1 and 0.8.4. Actually 0.9.1 is too high for my environment, considering the version of torch (and some other packages) vllm 0.9.1 requires.

I'm afraid I can't conduct the experiment immediately. But I will keep following up.

hi, have you found a solution? I tried another datasets, the same problem was showing up....

runrunliuliu avatar Jun 22 '25 08:06 runrunliuliu

Try vLLM 0.9.1 ?

In their issues they mention the vLLM versions used include 0.8.5.post1 and 0.8.4. Actually 0.9.1 is too high for my environment, considering the version of torch (and some other packages) vllm 0.9.1 requires. I'm afraid I can't conduct the experiment immediately. But I will keep following up.

hi, have you found a solution? I tried another datasets, the same problem was showing up....

I still cannot solve this issue. For now I just compromise to resume the training every time when it occurs...

shuoyinn avatar Jun 23 '25 02:06 shuoyinn

https://github.com/modelscope/ms-swift/pull/4770

it seems swift proposed a solution, btw, i did not try it ...

runrunliuliu avatar Jul 02 '25 02:07 runrunliuliu

modelscope/ms-swift#4770

it seems swift proposed a solution, btw, i did not try it ...

Well, in swift they encountered this issue with gc.collect(), and the solution proposed there is removing it. However, I didn't use this function in my code but still encountered it. So I don't think this works.

Today again... I check my runtime cpu memory and I cannot ensure whether it is normal, maybe it is about cpu OOM (offload)

Image

shuoyinn avatar Jul 02 '25 02:07 shuoyinn