stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

The OOM problem caused by the Transformers version

Open kiseliu opened this issue 2 years ago • 2 comments

A month ago, I train the alpaca with 4 A100 GPUs (each 80G) and per_device_train_batch_size=4. Here transformers==4.28.1.

Today I retrain the alpaca with the same hardwares and the same code, but there has an OOM problem. It can work only when per_device_train_batch_size=1. Through the wandb, I found the transformers version in my virtual environment is transformers==4.31.dev0. Then I change the transformers version to 4.28.1, I can run train the alpaca with per_device_train_batch_size=4.

Anyone has the idea?

kiseliu avatar Jun 12 '23 13:06 kiseliu

are you using depacoda/llama-7b-hf? and the exact same training command as in readme?

yxchng avatar Jun 25 '23 02:06 yxchng

are you using depacoda/llama-7b-hf? and the exact same training command as in readme?

yes, the exact same training command as in readme

kiseliu avatar Jun 25 '23 13:06 kiseliu