DeepSpeedExamples My deepspeed code is very slow

2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

Hi everyone, I am using Zero 3-stage. I can see the above message every step. The training process is very slow. How to change my config to speed up? My config: { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 5e8, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 5e8, "stage3_max_reuse_distance": 5e8, "stage3_gather_fp16_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

Mar 27 '22 08:03 zhaowei-wang-nlp

same problem

Apr 13 '23 08:04 CaralHsi

Pytorch allocator cache flushes are very expensive but indicate severe memory pressure. Can you try reducing batch size?

Apr 13 '23 11:04 tjruwase

same problem

Apr 18 '23 02:04 YiAthena

same problem

May 08 '23 14:05 zhangyanbo2007

same problem

May 08 '23 15:05 zhangyanbo2007

Same issue here, any updates?

Jun 26 '23 14:06 joanrod

same problem

Jul 05 '23 11:07 lusongshuo-mt

👀

Jul 17 '23 11:07 iamlockelightning

same

Aug 04 '23 06:08 teaguexiao

Any update on this issue? I am using Pytorch Lightning, originally I thought it is because I am passing too many things for each step, but after I change those, the problem is still there.

I have tried reducing the batch size, and also changing the pin_memory to False according to https://discuss.pytorch.org/t/when-to-set-pin-memory-to-true/19723 (some pytorch version has that issue), but with no luck.