dolly icon indicating copy to clipboard operation
dolly copied to clipboard

OOM issue when finetune with V100

Open bingjie3216 opened this issue 2 years ago • 11 comments
trafficstars

I tried to run the training.trainer script with batch size == 1 (originally it is 8), but met OOM issue with V100.

Has anyone tried to finetune it with V100-32G? or any machine does not have 80GB as A100?

Here is my training script based on the instruction: deepspeed --num_gpus=8 --module training.trainer --deepspeed ./dolly/config/ds_z3_bf16_config.json --epochs 1 --local-output-dir ./dolly/local_output_dir --dbfs-output-dir ./dolly/dbfs_output_dir --per-device-train-batch-size 1 --per-device-eval-batch-size 1 --lr 1e-5

bingjie3216 avatar Mar 25 '23 15:03 bingjie3216

I have not, but have a few ideas for you if you want to experiment:

V100s do not support bf16 like Ampere GPUs do. To be sure you are at least using fp16, add --fp16 to this command. You may need to add this to the deepspeed config to ensure it respects that:

    "fp16": {
      "enabled": "auto"
    },

This may affect the quality of the resulting model.

Try adding --gradient_checkpointing

Try deleting the "optimizer" section of the deepspeed config, and passing --optim adafactor, which is a more memory-efficient optimizer.

There is more you can do with deepspeed. These are just some ideas that helped me in a different large tuning problem.

srowen avatar Mar 25 '23 17:03 srowen

Thanks a lot for sharing the knowledge, I have tried a bunch of stuff, and let's see how it goes:

I lowered the batch size to 1 and also modified the ds config as follows (allow offload to CPU):

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
      }
    },
    "scheduler": {
      "type": "WarmupLR",
      "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto"
      }
    },
    "zero_optimization": {
      "stage": 3,
      "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
  }

bingjie3216 avatar Mar 25 '23 18:03 bingjie3216

Does this demo require A100-40G or A100-80G?

youkaichao avatar Mar 26 '23 04:03 youkaichao

Does this demo require A100-40G or A100-80G?

It has trained successfully on 8 A100 40 GB (e.g. Standard_ND96asr_v4).

matthayes avatar Mar 26 '23 05:03 matthayes

Some update with V100, I am running it on V100-32G GPUs, 8 of them on one node:

Last step shows: {'eval_loss': 1.3447265625, 'eval_runtime': 25.5031, 'eval_samples_per_second': 39.211, 'eval_steps_per_second': 2.47, 'epoch': 1.0} Training completed. Do not forget to share your model on huggingface.co/models =) .. [2023-03-26 11:05:19,559] [INFO] [engine.py:2963:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 0 {'train_runtime': 16393.4453, 'train_samples_per_second': 3.111, 'train_steps_per_second': 0.097, 'train_loss': 1.3751295118439602, 'epoch': 1.0}

However, the code seems to fail in the last step: Deleting older checkpoint [/home/azureuser/cloudfiles/code/Users/xxx/code/dolly/local_output_dir_0325/checkpoint-1400] due to args.save_total_limit 2023-03-26 11:05:19 ERROR [main] main failed ...

File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/xxx-gpu-dolly/code/Users/jbing/code/dolly/training/trainer.py", line 207, in train trainer.train() File "/anaconda/envs/dolly/lib/python3.10/site-packages/transformers/trainer.py", line 1527, in train return inner_training_loop( File "/anaconda/envs/dolly/lib/python3.10/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop shutil.rmtree(checkpoint) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 725, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) FileNotFoundError: [Errno 2] No such file or directory: 'added_tokens.json'

File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/xxx-gpu-dolly/code/Users/jbing/code/dolly/training/trainer.py", line 207, in train trainer.train() File "/anaconda/envs/dolly/lib/python3.10/site-packages/transformers/trainer.py", line 1527, in train return inner_training_loop( File "/anaconda/envs/dolly/lib/python3.10/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop shutil.rmtree(checkpoint) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 725, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 658, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) FileNotFoundError: [Errno 2] No such file or directory: 'zero_pp_rank_0_mp_rank_00_optim_states.pt'

The good news is I got the checkpoint-1400 to use, the bad news is I don't have the latest one.

BTW, what is your final stage's training loss? mine is 1.37, and the step number?

bingjie3216 avatar Mar 26 '23 15:03 bingjie3216

I am not sure whether it is a bug in the code: Deleting older checkpoint [/home/azureuser/cloudfiles/code/Users/jbing/code/dolly/local_output_dir_0325/checkpoint-1400] due to args.save_total_limit

My latest checkpoint is checkpoint-1400 and the one that should be deleted is actually checkpoint-1200. I feel strange to see such a log message.

bingjie3216 avatar Mar 26 '23 15:03 bingjie3216

I am not sure whether it is a bug in the code: Deleting older checkpoint [/home/azureuser/cloudfiles/code/Users/jbing/code/dolly/local_output_dir_0325/checkpoint-1400] due to args.save_total_limit

My latest checkpoint is checkpoint-1400 and the one that should be deleted is actually checkpoint-1200. I feel strange to see such a log message.

Hello, I met the same question. What did you modify to config V100 cluster?

yinwangsong avatar Mar 26 '23 15:03 yinwangsong

I am retrying by changing the following parameters in the trainer code: save_total_limit=3, load_best_model_at_end=False,

I think it might be a bug in transformers.

@yinwangsong Do you mean you met the same error in the last step? "What did you modify to config V100 cluster?" => I am trying the above changes and let us see.

bingjie3216 avatar Mar 26 '23 15:03 bingjie3216

I am retrying by changing the following parameters in the trainer code: save_total_limit=3, load_best_model_at_end=False,

I think it might be a bug in transformers.

@yinwangsong Do you mean you met the same error in the last step? "What did you modify to config V100 cluster?" => I am trying the above changes and let us see.

No. I run the notebook on 8 V100 GPUs, but an error occured:

  File "<string>", line 105, in __init__
  File "/databricks/python/lib/python3.9/site-packages/transformers/training_args.py", line 1098, in __post_init__
    raise ValueError(
ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

I changed "bf16" to "fp16" in ds_z3_bf16_config.json, but nothing happen...

yinwangsong avatar Mar 26 '23 16:03 yinwangsong

To be totally complete, you can enable bf16 in the deepspeed config, and disable fp16. Don't leave them at "auto" or whatever you have set. But setting to 'auto' and passing --bf16 should be sufficient. I'm not sure how you run this.

srowen avatar Mar 26 '23 17:03 srowen

@bingjie3216 yeah that looks like a problem with Trainer or some issue with the files on your local FS. It can't delete something it wants to delete. Permissions?

srowen avatar Mar 26 '23 17:03 srowen

@srowen it is not about the permissions, the failure it claims is it could not find the file that it wants to delete. I believe it is a transformer bug.

I have done the following two things:

  1. change some of the parameters like mentioned above: load_best_model_at_end=False,
  2. I also try to finetune with alpaca's trainer code.

bingjie3216 avatar Mar 26 '23 19:03 bingjie3216

Update: both 1 and 2 could solve the problem I met earlier.

bingjie3216 avatar Mar 26 '23 20:03 bingjie3216

@bingjie3216 any plans to share your model on huggingface.co/models?

slavakurilyak avatar Mar 27 '23 01:03 slavakurilyak

If any one is curious, this is how it looks like without changing any default parameters on 4 GPUs (A100 80G): Screenshot 2023-03-27 at 16 05 37

I guess you can go from batch size of 8 to 4 or 2 to use less memory.

maziyarpanahi avatar Mar 27 '23 14:03 maziyarpanahi

@bingjie3216 Out of curiosity, how much RAM you use when train dolly on 8 V100? I am currently try to reproduce this on 8 V100 too but I just have 128G RAM, training just take all of it and quit with code 9, seems like a oom but hvae no log to do some trouble-shooting

Metal-joker avatar Mar 28 '23 10:03 Metal-joker

Here are some notes on getting training working on A10 and V100 GPUs: https://github.com/databrickslabs/dolly/pull/30/files

srowen avatar Mar 28 '23 15:03 srowen

Some update with V100, I am running it on V100-32G GPUs, 8 of them on one node:

Last step shows: {'eval_loss': 1.3447265625, 'eval_runtime': 25.5031, 'eval_samples_per_second': 39.211, 'eval_steps_per_second': 2.47, 'epoch': 1.0} Training completed. Do not forget to share your model on huggingface.co/models =) .. [2023-03-26 11:05:19,559] [INFO] [engine.py:2963:_load_zero_checkpoint] loading 8 zero partition checkpoints for rank 0 {'train_runtime': 16393.4453, 'train_samples_per_second': 3.111, 'train_steps_per_second': 0.097, 'train_loss': 1.3751295118439602, 'epoch': 1.0}

However, the code seems to fail in the last step: Deleting older checkpoint [/home/azureuser/cloudfiles/code/Users/xxx/code/dolly/local_output_dir_0325/checkpoint-1400] due to args.save_total_limit 2023-03-26 11:05:19 ERROR [main] main failed ...

File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/xxx-gpu-dolly/code/Users/jbing/code/dolly/training/trainer.py", line 207, in train trainer.train() File "/anaconda/envs/dolly/lib/python3.10/site-packages/transformers/trainer.py", line 1527, in train return inner_training_loop( File "/anaconda/envs/dolly/lib/python3.10/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop shutil.rmtree(checkpoint) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 725, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) FileNotFoundError: [Errno 2] No such file or directory: 'added_tokens.json'

File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/xxx-gpu-dolly/code/Users/jbing/code/dolly/training/trainer.py", line 207, in train trainer.train() File "/anaconda/envs/dolly/lib/python3.10/site-packages/transformers/trainer.py", line 1527, in train return inner_training_loop( File "/anaconda/envs/dolly/lib/python3.10/site-packages/transformers/trainer.py", line 1920, in _inner_training_loop shutil.rmtree(checkpoint) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 725, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 658, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 681, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/anaconda/envs/dolly/lib/python3.10/shutil.py", line 679, in _rmtree_safe_fd os.unlink(entry.name, dir_fd=topfd) FileNotFoundError: [Errno 2] No such file or directory: 'zero_pp_rank_0_mp_rank_00_optim_states.pt'

The good news is I got the checkpoint-1400 to use, the bad news is I don't have the latest one.

BTW, what is your final stage's training loss? mine is 1.37, and the step number?

@bingjie3216 I'm also trying to train the model on Single V100 GPU with 16 GB memory. Can you please share me your ENV details? I'm somehow facing issues. Not able to train the model.

chintan-donda avatar May 17 '23 11:05 chintan-donda