xgboost_ray icon indicating copy to clipboard operation
xgboost_ray copied to clipboard

disk space usage problem

Open showkeyjar opened this issue 2 years ago • 11 comments

I found one problem:

if I use xgboost_ray to train multiple models on linux, I found the "/tmp/ray/" dir size will continued growth.

and if train data is large, the system dist space run out quickly.

I try to fix it by "rm -rf /tmp/ray/", but the train process stucked in an endless loop, and wait for ray actor forever.

I guess "import xgboost_ray" may do some init for ray,

so I add "import importlib" and try to "importlib.reload('xgboost_ray')", but it not work.

please check this issue.

showkeyjar avatar Jan 13 '23 02:01 showkeyjar

cc @matthewdeng what's the best way to debug object store memory usage for xgboost on ray?

@showkeyjar I think your workload has high object store usage which triggers spilling https://docs.ray.io/en/master/ray-core/objects/object-spilling.html.

When your disk usage keeps increasing, what's the output of ray memory --stats-only?

rkooo567 avatar Jan 13 '23 11:01 rkooo567

@showkeyjar do you have a repro for this? How much training data are you loading and how much disk space are you seeing consumed?

matthewdeng avatar Jan 13 '23 16:01 matthewdeng

Are you using Ray Datasets? There's an issue with xgboost-ray we are working on currently that causes the data to be loaded in a suboptimal manner, causing too much object store usage.

Yard1 avatar Jan 13 '23 17:01 Yard1

thanks for all your advice,

@rkooo567 ray memory --stats-only cannot detect any ray instance: ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting the --address flag or RAY_ADDRESS environment variable.

@matthewdeng 1395642 train data, boost round 20, disk usage 15G

train code is here: https://github.com/showkeyjar/mymodel/blob/main/train_model_ray.py

@Yard1 no, I use pandas dataframe convert to ray dataset.

showkeyjar avatar Jan 16 '23 04:01 showkeyjar

I alleviated the problem using shell for loop script to call python train code, but I still don't know why python for loop cause disk increase.

and I'm sure that the disk incease happened at /tmp/ray/ dir.

showkeyjar avatar Jan 16 '23 05:01 showkeyjar

Ray is using a mechanism called object spilling, where objects that cannot fit into the memory object store are instead put on disk. Can you run the ray memory --stats-only command in a separate terminal window while the xgboost-ray training is in progress?

Also, are you running this on a single machine, or multiple machines?

Yard1 avatar Jan 16 '23 05:01 Yard1

@Yard1

======== Object references status: 2023-01-16 15:19:13.215008 ======== --- Aggregate object store stats across all nodes --- Plasma memory usage 67279 MiB, 40 objects, 62.69% full, 43.41% needed Objects consumed by Ray tasks: 67281 MiB.

showkeyjar avatar Jan 16 '23 07:01 showkeyjar

I'm so depressed this issues has not been solved yet, but I found some new infomations:

  1. ray will store its temp file in /tmp/ray/session_{datetime}_XXXX_XXXX/ dir if we could get the ray session dir, so we can remove temp file when xgb_ray train finished.
  2. ray can specific _temp_dir when init, but it still has bug, so, we can specific another temp dir when we train model if fix its bug.

hope those helps.

showkeyjar avatar Mar 30 '23 06:03 showkeyjar

Based on your output ^, it looks like spilling actually doesn't really happen. I guess most of disk usage is from ray logs?

rkooo567 avatar Mar 30 '23 15:03 rkooo567

Is it correct the disk usage is mostly from /tmp/ray/session_latest/logs/?

rkooo567 avatar Mar 30 '23 15:03 rkooo567

Is it correct the disk usage is mostly from /tmp/ray/session_latest/logs/?

yes, it create a link /tmp/ray/session_latest/ to /tmp/ray/session_{datetime}_XXXX_XXXX/

showkeyjar avatar Mar 31 '23 07:03 showkeyjar