alpaca-lora
alpaca-lora copied to clipboard
Report hardware spec and parameters that training works
It might be a good idea to share the hardware spec and parameters that you got the fine-tuning to work to get a sense of the hardware requirement.
RTX 4080 16GB
System
RTX 4080 16GB Intel i7 13700 64GB RAM Ubuntu 22.04.2 LTS
Parameters Model size: 7B MICRO_BATCH_SIZE = 3 EPOCHS = 2
Peak VRAM usage is about 15.8GB which is almost out of memory. It may be more comfortable to set MICRO_BATCH_SIZE
to 2.
The training completed in 5.5 hours, and the result model behaves similar to the published model.
Hm, no that's not normal. My progress bar is moving. It should also show you some additional training information periodically.
I am thinking of getting GEFORCE RTX 3060, does anybody know or can guess can it be used for fine-tuning the model? Can it be estimated how long it will take?
GEFORCE RTX 3060 Likely not enough memory. 3060 are 8 or 12gb.
What are the GPU reqs to do fine-tuning on the smallest model?
RTX 4070 Ti
Training for GPUs w/ < 16GB VRAM
System
- RTX 4070Ti 12GB
- Ryzen 9 7900x
- 32 GB RAM
- Linux Mint 21.1
Parameters
MICRO_BATCH_SIZE = 1
BATCH_SIZE = 32
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
EPOCHS = 2
LEARNING_RATE = 3e-4
CUTOFF_LEN = 256
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
VAL_SET_SIZE = 2000
You may be able to set MICRO_BATCH_SIZE
= 2 but have not tested yet.
Total training time was near 9.5hrs (not ideal)
2 * RTX 3090 Ti takes about 4.5 hours for 3 epochs with default parameters. The vram usage is about 20 GB.
@UranusSeven , am also running 2*3090, how did you force-distribute memory across both cards for fine-tuning? tried setting max-memory and updated the new finetune.py from git but no luck... still maxes out on card 0 . thx
Here's what I did:
WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun \
--nproc_per_node=2 \
--master_port=1234 \
finetune.py \
--base_model 'decapoda-research/llama-7b-hf' \
--data_path '/path/to/alpaca_data.json' \
--output_dir './lora-alpaca'
@pyamin1878 i use ths same hardware as you ,however it is hard for me to train even though using your parameters. would you please show me your full parameters or give me some advice ?
utputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 11.72 GiB total capacity; 9.35 GiB already allocated; 38.69 MiB free; 10.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 6%|███▉ | 200/3110 [35:11<8:32:00, 10.56s/it]
@pyamin1878 i use ths same hardware as you ,however it is hard for me to train even though using your parameters. would you please show me your full parameters or give me some advice ?
utputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 11.72 GiB total capacity; 9.35 GiB already allocated; 38.69 MiB free; 10.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 6%|███▉ | 200/3110 [35:11<8:32:00, 10.56s/it]
Can you show me your parameters, it looks like you got an OOM error during training is that correct?
@EveningLin Also you may need to make sure you are using bitsandbytes==0.37.2
check the issue here: https://github.com/TimDettmers/bitsandbytes/issues/324#issue-1670782993
@pyamin1878 Thank you very much I have been puzzled for several hours trying to figure out why I was getting an OOM error when saving the model. I have downgraded to 0.37.2 and it worked no problem. For reference I am using a RTX 3060 12 GB and the decapoda-research/llama-7b-hf model.