Ditto P S

Results 55 comments of Ditto P S

I have tried ZeRO3, but the RAM is getting overloaded because of the parameter offloading. I currently have 420GB RAM and 4 A100 80 GB.

Any thoughts on how many GPUs might be needed? Also, I'm not seeing the current 4 GPU's getting filled with ZeRO3, rather it's taking the RAM.

I'm not calling that function in my script. I was following the example here to enable flash attn. https://github.com/huggingface/optimum-habana/blob/main/examples/language-modeling/run_lora_clm.py Here is my train script ``` import pickle import os from...

Here is the command ``` deepspeed train-gaudi.py --base_model budecosystem/boomer-1b --output_dir output/boomer --data_path roneneldan/TinyStories --learning_rate 1e-3 --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --warmup_ratio 0.1 --report_to wandb --logging_steps 10 --save_strategy...

Thanks for your support. For some reason, I'm able to run the script without any issues now. I have another question, does this flash attention have the same effect as...

I tried with the latest code from the main branch, but still getting the same issue

@ArthurZucker I have the META weights and tokenizer. The issue share is with that. For sentencepiece, is there a specific version to be used?