llm-foundry
llm-foundry copied to clipboard
GPU OOM while fine-tuning MPT-7B
Hi,
I want to finetune MPT-7b and I get OOM error. This is what I run:
python ./llm-foundry/scripts/train/train.py ./llm-foundry/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml train_loader.dataset.split=train eval_loader.dataset.split=test
I have changed the seq_length to 512.
And I have not changed anything in the yaml file since I want to fine tune it in the default dataset first. I use the gdn.12xlarge instance. Appreciate any input
I believe that AWS instance has 4xT4 ~= 64GB of VRAM. You want at least twice that. Also, this stack is mostly tested on AA100s and there have been reports in the Triton repo that Triton does not work (or does not work well) on T4s.
@samhavens thanks for your comment. I cannot run it on g5.48xlarge with 768 memory. each gpu has 24gig ram.
after running nvidia-smi I get 8 gpus with these specification::
0 NVIDIA A10G Off | 00000000:00:16.0 Off | 0 | | 0% 30C P0 59W / 300W | 0MiB / 22731MiB | 0% Default
Do you get an OOM with the A10s as well, or a different error?
Yes I get OOM with A10 @samhavens This is the whole error:
/llm-foundry/scripts/train/tra │ │ in.py:268 in <module> │ │ │ │ 265 │ │ yaml_cfg = om.load(f) │ │ 266 │ cli_cfg = om.from_cli(args_list) │ │ 267 │ cfg = om.merge(yaml_cfg, cli_cfg) │ │ ❱ 268 │ main(cfg) │ │ 269 │ │ │ │ /llm-foundry/scripts/train/tra │ │ in.py:210 in main │ │ │ │ 207 │ │ │ 208 │ # Build the Trainer │ │ 209 │ print('Building trainer...') │ │ ❱ 210 │ trainer = Trainer( │ │ 211 │ │ run_name=cfg.run_name, │ │ 212 │ │ seed=cfg.seed, │ │ 213 │ │ model=model, │ │ │ │ /local_disk0/.ephemeral_nfs/envs/pythonEnv-51886180-d57b-46eb-82e9-4b389eeb2 │ │ c8b/lib/python3.10/site-packages/composer/trainer/trainer.py:957 in __init__ │ │ │ │ 954 │ │ │ prepare_fsdp_module(model, optimizers, self.fsdp_config, │ │ 955 │ │ │ │ 956 │ │ # Reproducibility │ │ ❱ 957 │ │ rank_zero_seed, seed = _distribute_and_get_random_seed(seed, │ │ 958 │ │ # If hparams is used to create the Trainer this function is c │ │ 959 │ │ # which is okay because all runs with the hparams codepath wi │ │ 960 │ │ reproducibility.seed_all(seed) │ │ │ │ /local_disk0/.ephemeral_nfs/envs/pythonEnv-51886180-d57b-46eb-82e9-4b389eeb2 │ │ c8b/lib/python3.10/site-packages/composer/trainer/trainer.py:288 in │ │ _distribute_and_get_random_seed │ │ │ │ 285 │ │ raise ValueError(f'Invalid seed: {seed}. It must be on [0; 2* │ │ 286 │ │ │ 287 │ # using int64 to prevent overflow │ │ ❱ 288 │ rank_zero_seed = device.tensor_to_device(torch.tensor([seed], dty │ │ 289 │ if dist.get_world_size() > 1: │ │ 290 │ │ dist.broadcast(rank_zero_seed, src=0) │ │ 291 │ rank_zero_seed = rank_zero_seed.item() │ │ │ │ /local_disk0/.ephemeral_nfs/envs/pythonEnv-51886180-d57b-46eb-82e9-4b389eeb2 │ │ c8b/lib/python3.10/site-packages/composer/devices/device_gpu.py:59 in │ │ tensor_to_device │ │ │ │ 56 │ │ return module.to(self._device) │ │ 57 │ │ │ 58 │ def tensor_to_device(self, tensor: torch.Tensor) -> torch.Tensor: │ │ ❱ 59 │ │ return tensor.to(self._device, non_blocking=True) │ │ 60 │ │ │ 61 │ def state_dict(self) -> Dict[str, Any]: │ │ 62 │ │ return {
also this is at the begining if that helps:
1 %sh python /llm-foundry/scripts/train/train.py /llm-foundry/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml train_loader.dataset.split=train eval_loader.dataset.split=test 2023-06-15 02:47:54.630018: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. /llm-foundry/scripts/train/train.py:131: UserWarning: FSDP is not applicable for single-GPU training. Reverting to DDP. warnings.warn( Running on device: cuda
appreciate any input!
Hello @benam2 , this may be a clue:
UserWarning: FSDP is not applicable for single-GPU training
Are you using a launcher to create multiple processes, one for each GPU? If not, you may only be using 1 of the A10s in your node.
Closing as stale. Please open a new issue if you are still encountering problems.