llm-foundry GPU OOM while fine-tuning MPT-7B

Hi,

I want to finetune MPT-7b and I get OOM error. This is what I run: python ./llm-foundry/scripts/train/train.py ./llm-foundry/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml train_loader.dataset.split=train eval_loader.dataset.split=test

I have changed the seq_length to 512.

And I have not changed anything in the yaml file since I want to fine tune it in the default dataset first. I use the gdn.12xlarge instance. Appreciate any input

Jun 14 '23 16:06 benam2

I believe that AWS instance has 4xT4 ~= 64GB of VRAM. You want at least twice that. Also, this stack is mostly tested on AA100s and there have been reports in the Triton repo that Triton does not work (or does not work well) on T4s.

Jun 14 '23 17:06 samhavens

@samhavens thanks for your comment. I cannot run it on g5.48xlarge with 768 memory. each gpu has 24gig ram.

after running nvidia-smi I get 8 gpus with these specification::

Jun 14 '23 17:06 benam2

Do you get an OOM with the A10s as well, or a different error?

Jun 14 '23 19:06 samhavens

Yes I get OOM with A10 @samhavens This is the whole error:

/llm-foundry/scripts/train/tra │ │ in.py:268 in <module> │ │ │ │ 265 │ │ yaml_cfg = om.load(f) │ │ 266 │ cli_cfg = om.from_cli(args_list) │ │ 267 │ cfg = om.merge(yaml_cfg, cli_cfg) │ │ ❱ 268 │ main(cfg) │ │ 269 │ │ │ │ /llm-foundry/scripts/train/tra │ │ in.py:210 in main │ │ │ │ 207 │ │ │ 208 │ # Build the Trainer │ │ 209 │ print('Building trainer...') │ │ ❱ 210 │ trainer = Trainer( │ │ 211 │ │ run_name=cfg.run_name, │ │ 212 │ │ seed=cfg.seed, │ │ 213 │ │ model=model, │ │ │ │ /local_disk0/.ephemeral_nfs/envs/pythonEnv-51886180-d57b-46eb-82e9-4b389eeb2 │ │ c8b/lib/python3.10/site-packages/composer/trainer/trainer.py:957 in __init__ │ │ │ │ 954 │ │ │ prepare_fsdp_module(model, optimizers, self.fsdp_config, │ │ 955 │ │ │ │ 956 │ │ # Reproducibility │ │ ❱ 957 │ │ rank_zero_seed, seed = _distribute_and_get_random_seed(seed, │ │ 958 │ │ # If hparams is used to create the Trainer this function is c │ │ 959 │ │ # which is okay because all runs with the hparams codepath wi │ │ 960 │ │ reproducibility.seed_all(seed) │ │ │ │ /local_disk0/.ephemeral_nfs/envs/pythonEnv-51886180-d57b-46eb-82e9-4b389eeb2 │ │ c8b/lib/python3.10/site-packages/composer/trainer/trainer.py:288 in │ │ _distribute_and_get_random_seed │ │ │ │ 285 │ │ raise ValueError(f'Invalid seed: {seed}. It must be on [0; 2* │ │ 286 │ │ │ 287 │ # using int64 to prevent overflow │ │ ❱ 288 │ rank_zero_seed = device.tensor_to_device(torch.tensor([seed], dty │ │ 289 │ if dist.get_world_size() > 1: │ │ 290 │ │ dist.broadcast(rank_zero_seed, src=0) │ │ 291 │ rank_zero_seed = rank_zero_seed.item() │ │ │ │ /local_disk0/.ephemeral_nfs/envs/pythonEnv-51886180-d57b-46eb-82e9-4b389eeb2 │ │ c8b/lib/python3.10/site-packages/composer/devices/device_gpu.py:59 in │ │ tensor_to_device │ │ │ │ 56 │ │ return module.to(self._device) │ │ 57 │ │ │ 58 │ def tensor_to_device(self, tensor: torch.Tensor) -> torch.Tensor: │ │ ❱ 59 │ │ return tensor.to(self._device, non_blocking=True) │ │ 60 │ │ │ 61 │ def state_dict(self) -> Dict[str, Any]: │ │ 62 │ │ return {

also this is at the begining if that helps:

1 %sh python /llm-foundry/scripts/train/train.py /llm-foundry/scripts/train/yamls/finetune/mpt-7b_dolly_sft.yaml train_loader.dataset.split=train eval_loader.dataset.split=test 2023-06-15 02:47:54.630018: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. /llm-foundry/scripts/train/train.py:131: UserWarning: FSDP is not applicable for single-GPU training. Reverting to DDP. warnings.warn( Running on device: cuda

appreciate any input!

Jun 15 '23 03:06 benam2

Hello @benam2 , this may be a clue:

UserWarning: FSDP is not applicable for single-GPU training

Are you using a launcher to create multiple processes, one for each GPU? If not, you may only be using 1 of the A10s in your node.

Jul 24 '23 06:07 hanlint

Closing as stale. Please open a new issue if you are still encountering problems.

Sep 07 '23 02:09 dakinggg

llm-foundry llm-foundry copied to clipboard

GPU OOM while fine-tuning MPT-7B

llm-foundry
llm-foundry copied to clipboard