sujithjoseph

Results 28 comments of sujithjoseph

Does it help if I increase gradient accumulations steps to 4 from 1. Will it help in model accuracy, since I may be able to fit more batch size?

Error also appears with fp16 instead of bf16 in deepspeed config and with zero3_init_flag: false in accelerate config with deepspeed as well.

With Stage2 and no offsets , Get a different error ``` /opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage_1_and_2. │ │ py:323 in __init__ │ │ │ │ 320 │ │ │ │ self.flatten_dense_tensors_aligned( │ │ 321...

Was able to sort it out using the below accelerate + DS config. Now dealing with an OOM issue, but not sure why the previous DeepSpeed config didnt work ```...

How can we estimate the # of GPUs Needed (each with 40 GB) . for flan-t5-11b with cpu param / optimizer offloading , 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1...

With Batch size as 1, it works without OOM. How can we estimate the # of GPUs needed for batchsize of 4 or 8, without trial and error. With batch...

With the following deepspeed config ``` deepspeed_config: gradient_accumulation_steps: 2 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 bf16:enabled: true ``` and torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 =...

@tjruwase , it did work with bf16. The only Question I have is Can i use max memory to restrict the memory used by the model during fine-tuning like the...