Otter icon indicating copy to clipboard operation
Otter copied to clipboard

Can I use a single gpu to train this model?

Open ElegantLin opened this issue 1 year ago • 9 comments

Before you open an issue, please check if a similar issue already exists or has been closed before.

When you open an issue, please be sure to include the following

  • [x] A descriptive title: Can I use a single gpu to train this model?
  • [x] A detailed description: Thanks for your nice work. I want to train the Otter model. May I use a single GPU to train the model. Could you please share your accelerate config? Thanks!
  • [ ] Assign an issue type tag (label):
    • dataset (mimic-it download, usage, etc.),
    • demo (online demo), doc (readme, wiki, paper, video etc.),
    • evaluation (evaluation result, performance of Otter etc.),
    • model (model configuration, components, etc.),
    • train (training configuration, process, code, etc.)

Thank you for your contributions!

ElegantLin avatar Jun 30 '23 06:06 ElegantLin

We put our configs here.

You can use 1 GPU with at least 33-34G memory to start training. If you load the model with bf16 or fp16, you can lower the GPU requirement to 16-20G I think. But we didnt try to train a model under bf16 or fp16 (we actually use mixed precision).

You can refer the following code to load a model under certain precisions.

import os  
import torch  
import argparse  
from .configuration_flamingo import FlamingoConfig  
from .modeling_flamingo import FlamingoForConditionalGeneration  
  
parser = argparse.ArgumentParser(description="Load model with precision")  
parser.add_argument('--load_bit', type=str, choices=['fp16', 'bf16'], required=True, help="Choose either 'fp16' or 'bf16'")  
args = parser.parse_args()  
  
load_bit = args.load_bit  
  
if load_bit == "fp16":  
    precision = {"torch_dtype": torch.float16}  
elif load_bit == "bf16":  
    precision = {"torch_dtype": torch.bfloat16}  
  
checkpoint_path = f"luodian/OTTER-9B-INIT"  
model = FlamingoForConditionalGeneration.from_pretrained(checkpoint_path, device_map="auto", **precision)  
  
# save model  
checkpoint_path = checkpoint_path + f"_{load_bit}"  
model.save_pretrained(checkpoint_path)  

Luodian avatar Jul 02 '23 03:07 Luodian

Got it. Thanks a lot!

ElegantLin avatar Jul 02 '23 21:07 ElegantLin

Hi Bo, I tried the following script, but I got an error. I want to train the model using LA dataset and I added the image_processor and train_num_samples back to the args. The command I used is

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml 
 pipeline/train/instruction_following.py \ 
--pretrained_model_name_or_path=/path/to/OTTER-9B-INIT \
--mimicit_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_instructions.json,/path/toOtter/OneDrive_1_6-27-2023/LA/LADD_instructions.json" \ 
--images_path="/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json" \
--train_config_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LADD_train.json" \
--batch_size=16 \
--num_epoch=6 \
--external_save_dir=./checkpoints \
--save_hf_model \
--workers=8 \
--cross_attn_every_n_layers=4 \
--lr_schedule=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01

The error I got is

Traceback (most recent call last):
  File "/path/to/Otter/pipeline/train/instruction_following.py", line 498, in <module>
    main()
  File "/path/to/Otter/pipeline/train/instruction_following.py", line 451, in main
    model, optimizer, lr_scheduler, mimicit_loaders = accelerator.prepare(model, optimizer, lr_scheduler, mimicit_loaders)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model
    torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model
AttributeError: 'function' object has no attribute '__func__'

Could you please give me some suggestions? I am very new to the accelerate package.

The GPU I used is A100 and I think it is not the GPU isssue.

Thanks!

ElegantLin avatar Jul 02 '23 23:07 ElegantLin

Hi May I know if you properly install the packages, how about the accelerate, transformers and torch version?

If the issue still exists, you could switch to yhzhang/dev_laion400m branch to try to run the same script. This branch has the latest changes of our code.

Luodian avatar Jul 03 '23 00:07 Luodian

Yes. My software versions are as follows.

transformers: 4.30.2 accelerate: 0.20.3 torch: 2.0.1

Sure. I can try the latest branch.

Thanks!

ElegantLin avatar Jul 03 '23 05:07 ElegantLin

Hi Bo, I tried the following script, but I got an error. I want to train the model using LA dataset and I added the image_processor and train_num_samples back to the args. The command I used is

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml 
 pipeline/train/instruction_following.py \ 
--pretrained_model_name_or_path=/path/to/OTTER-9B-INIT \
--mimicit_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_instructions.json,/path/toOtter/OneDrive_1_6-27-2023/LA/LADD_instructions.json" \ 
--images_path="/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json" \
--train_config_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LADD_train.json" \
--batch_size=16 \
--num_epoch=6 \
--external_save_dir=./checkpoints \
--save_hf_model \
--workers=8 \
--cross_attn_every_n_layers=4 \
--lr_schedule=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01

The error I got is

Traceback (most recent call last):
  File "/path/to/Otter/pipeline/train/instruction_following.py", line 498, in <module>
    main()
  File "/path/to/Otter/pipeline/train/instruction_following.py", line 451, in main
    model, optimizer, lr_scheduler, mimicit_loaders = accelerator.prepare(model, optimizer, lr_scheduler, mimicit_loaders)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model
    torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model
AttributeError: 'function' object has no attribute '__func__'

Could you please give me some suggestions? I am very new to the accelerate package.

The GPU I used is A100 and I think it is not the GPU isssue.

Thanks!

I also encountered this problem and updating the accelerate package to the GitHub version worked for me (see this PR https://github.com/huggingface/accelerate/pull/1637).

zhoudw-zdw avatar Jul 09 '23 13:07 zhoudw-zdw

Hi Bo, I tried the following script, but I got an error. I want to train the model using LA dataset and I added the image_processor and train_num_samples back to the args. The command I used is

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml 
 pipeline/train/instruction_following.py \ 
--pretrained_model_name_or_path=/path/to/OTTER-9B-INIT \
--mimicit_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_instructions.json,/path/toOtter/OneDrive_1_6-27-2023/LA/LADD_instructions.json" \ 
--images_path="/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json" \
--train_config_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LADD_train.json" \
--batch_size=16 \
--num_epoch=6 \
--external_save_dir=./checkpoints \
--save_hf_model \
--workers=8 \
--cross_attn_every_n_layers=4 \
--lr_schedule=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01

The error I got is

Traceback (most recent call last):
  File "/path/to/Otter/pipeline/train/instruction_following.py", line 498, in <module>
    main()
  File "/path/to/Otter/pipeline/train/instruction_following.py", line 451, in main
    model, optimizer, lr_scheduler, mimicit_loaders = accelerator.prepare(model, optimizer, lr_scheduler, mimicit_loaders)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model
    torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model
AttributeError: 'function' object has no attribute '__func__'

Could you please give me some suggestions? I am very new to the accelerate package.

The GPU I used is A100 and I think it is not the GPU isssue.

Thanks!

I also meet this problem, anyone could offer me a hand?

Hongbin98 avatar Jul 13 '23 07:07 Hongbin98

Hi Bo, I tried the following script, but I got an error. I want to train the model using LA dataset and I added the image_processor and train_num_samples back to the args. The command I used is

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml 
 pipeline/train/instruction_following.py \ 
--pretrained_model_name_or_path=/path/to/OTTER-9B-INIT \
--mimicit_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_instructions.json,/path/toOtter/OneDrive_1_6-27-2023/LA/LADD_instructions.json" \ 
--images_path="/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json" \
--train_config_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LADD_train.json" \
--batch_size=16 \
--num_epoch=6 \
--external_save_dir=./checkpoints \
--save_hf_model \
--workers=8 \
--cross_attn_every_n_layers=4 \
--lr_schedule=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01

The error I got is

Traceback (most recent call last):
  File "/path/to/Otter/pipeline/train/instruction_following.py", line 498, in <module>
    main()
  File "/path/to/Otter/pipeline/train/instruction_following.py", line 451, in main
    model, optimizer, lr_scheduler, mimicit_loaders = accelerator.prepare(model, optimizer, lr_scheduler, mimicit_loaders)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare
    result = tuple(
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model
    torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model
AttributeError: 'function' object has no attribute '__func__'

Could you please give me some suggestions? I am very new to the accelerate package. The GPU I used is A100 and I think it is not the GPU isssue. Thanks!

I also meet this problem, anyone could offer me a hand?

See the above post at the first please.

ZhangYuanhan-AI avatar Jul 13 '23 07:07 ZhangYuanhan-AI

We put our configs here.

You can use 1 GPU with at least 33-34G memory to start training. If you load the model with bf16 or fp16, you can lower the GPU requirement to 16-20G I think. But we didnt try to train a model under bf16 or fp16 (we actually use mixed precision).

You can refer the following code to load a model under certain precisions.

import os  
import torch  
import argparse  
from .configuration_flamingo import FlamingoConfig  
from .modeling_flamingo import FlamingoForConditionalGeneration  
  
parser = argparse.ArgumentParser(description="Load model with precision")  
parser.add_argument('--load_bit', type=str, choices=['fp16', 'bf16'], required=True, help="Choose either 'fp16' or 'bf16'")  
args = parser.parse_args()  
  
load_bit = args.load_bit  
  
if load_bit == "fp16":  
    precision = {"torch_dtype": torch.float16}  
elif load_bit == "bf16":  
    precision = {"torch_dtype": torch.bfloat16}  
  
checkpoint_path = f"luodian/OTTER-9B-INIT"  
model = FlamingoForConditionalGeneration.from_pretrained(checkpoint_path, device_map="auto", **precision)  
  
# save model  
checkpoint_path = checkpoint_path + f"_{load_bit}"  
model.save_pretrained(checkpoint_path)  

Thanks very much! I've implement it and trained on fp16, also change the data into torch.float16, but meet the error below: image Does anyone figure it out?

5RJ avatar Sep 01 '23 04:09 5RJ