abpani
abpani
Even I tried with deepspeed and accelerate I cant do more than batch size of 1.
 Now I can do batch size of 2 but the last gpu is almost full. It may give oom on a specific sample...
You suggested to try deepspeed. I tried but my acceleartor.process_index shows only 0 gpu so the model gets loaded to GPU 0 only. then it raises OOM error with batch...
# This is the traceback when I do batchsize 3. Traceback (most recent call last): File "/home/ubuntu/abpani/FundName/llama3_8b_qlora-bnb.py", line 94, in trainer.train() File "/home/ubuntu/abpani/FundName/myenv/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 450, in train output = super().train(*args,...
Now I just tried with Mistral v0.3 Instruct with per_device_train_batch_size = 6, per_device_eval_batch_size = 6, gradient_accumulation_steps = 8,  It is working great.
@qgallouedec I am trying with 8192 context length. I tried installing a lot of package combinations still same issue using models with large vocab size.
@qgallouedec Still the issue persists with Qwen2.5 3B model cant even go more than 1 batch size for bitsandbytes 4bit context_length = 5192 I am unable to think why you...
still same issue. it shows different errors like loaded in different devices . cuda 0 and cuda 1 device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0,...
@LysandreJik You can find the details about device map here https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/blob/main/model.safetensors.index.json
@LysandreJik I tried as you suggested still same issue in a multi gpu environment.