DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

My deepspeed code is very slow

Open zhaowei-wang-nlp opened this issue 2 years ago • 21 comments

2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

Hi everyone, I am using Zero 3-stage. I can see the above message every step. The training process is very slow. How to change my config to speed up? My config: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 5e8, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 5e8, "stage3_max_reuse_distance": 5e8, "stage3_gather_fp16_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

zhaowei-wang-nlp avatar Mar 27 '22 08:03 zhaowei-wang-nlp

same problem

CaralHsi avatar Apr 13 '23 08:04 CaralHsi

Pytorch allocator cache flushes are very expensive but indicate severe memory pressure. Can you try reducing batch size?

tjruwase avatar Apr 13 '23 11:04 tjruwase

same problem

YiAthena avatar Apr 18 '23 02:04 YiAthena

same problem

zhangyanbo2007 avatar May 08 '23 14:05 zhangyanbo2007

same problem

zhangyanbo2007 avatar May 08 '23 15:05 zhangyanbo2007

Same issue here, any updates?

joanrod avatar Jun 26 '23 14:06 joanrod

same problem

lusongshuo-mt avatar Jul 05 '23 11:07 lusongshuo-mt

👀

iamlockelightning avatar Jul 17 '23 11:07 iamlockelightning

same

teaguexiao avatar Aug 04 '23 06:08 teaguexiao

Any update on this issue? I am using Pytorch Lightning, originally I thought it is because I am passing too many things for each step, but after I change those, the problem is still there.

I have tried reducing the batch size, and also changing the pin_memory to False according to https://discuss.pytorch.org/t/when-to-set-pin-memory-to-true/19723 (some pytorch version has that issue), but with no luck.

dnaihao avatar Aug 10 '23 15:08 dnaihao

I used 8xA100 with same settings and this message would gone..

teaguexiao avatar Aug 11 '23 02:08 teaguexiao

Thanks @teaguexiao, I will try using more GPUs (but ours are A40 of 48 GB memory each) to see if that can help. Thanks for sharing!

dnaihao avatar Aug 11 '23 02:08 dnaihao

I use 8*A100 40G, and reduce the batch size, and wait about 20 minutes, then there's no such message now.

bingwork avatar Dec 06 '23 08:12 bingwork

same problem

wulaoshi avatar Jan 09 '24 07:01 wulaoshi

same problem here

  • torch==2.2.1
  • transformers==4.38.2
  • tokenizers==0.15.2
  • huggingface-hub==0.21.3
  • bitsandbytes==0.42.0
  • cloudpickle==3.0.0
  • accelerate==0.27.1
  • flash-attn==2.5.6
  • deepspeed==0.13.4
  • datasets==2.17.0
  • loralib==0.1.2
  • einops==0.7.0
  • peft==0.9.0
  • trl==0.7.10

Using deepspeedtorchdistributor in databricks, loading the model with flash-attn 2

achangtv avatar Mar 08 '24 21:03 achangtv

same issue here am running on 8 MI250X AMD GPUS with 128 GB VRAM

ed-00 avatar Mar 25 '24 12:03 ed-00

same problem.

Sander-houqi avatar Apr 25 '24 07:04 Sander-houqi

same problem using 8 v100 gpus

zimenglan-sysu-512 avatar Jul 15 '24 11:07 zimenglan-sysu-512

Same problem. I would like to know whether this issue will deteriorate the model's performance or if it only affects the training efficiency.

heya5 avatar Jul 18 '24 05:07 heya5

same issue: deepspeed: torch:2.1.0.dev20230424+cu117 deepspeed:0.11.0

absorbguo avatar Aug 03 '24 11:08 absorbguo