Otter
Otter copied to clipboard
Can I use a single gpu to train this model?
Before you open an issue, please check if a similar issue already exists or has been closed before.
When you open an issue, please be sure to include the following
- [x] A descriptive title: Can I use a single gpu to train this model?
- [x] A detailed description: Thanks for your nice work. I want to train the Otter model. May I use a single GPU to train the model. Could you please share your
accelerate config
? Thanks! - [ ] Assign an issue type tag (label):
-
dataset
(mimic-it download, usage, etc.), -
demo
(online demo),doc
(readme, wiki, paper, video etc.), -
evaluation
(evaluation result, performance of Otter etc.), -
model
(model configuration, components, etc.), -
train
(training configuration, process, code, etc.)
-
Thank you for your contributions!
We put our configs here.
You can use 1 GPU with at least 33-34G memory to start training. If you load the model with bf16 or fp16, you can lower the GPU requirement to 16-20G I think. But we didnt try to train a model under bf16 or fp16 (we actually use mixed precision).
You can refer the following code to load a model under certain precisions.
import os
import torch
import argparse
from .configuration_flamingo import FlamingoConfig
from .modeling_flamingo import FlamingoForConditionalGeneration
parser = argparse.ArgumentParser(description="Load model with precision")
parser.add_argument('--load_bit', type=str, choices=['fp16', 'bf16'], required=True, help="Choose either 'fp16' or 'bf16'")
args = parser.parse_args()
load_bit = args.load_bit
if load_bit == "fp16":
precision = {"torch_dtype": torch.float16}
elif load_bit == "bf16":
precision = {"torch_dtype": torch.bfloat16}
checkpoint_path = f"luodian/OTTER-9B-INIT"
model = FlamingoForConditionalGeneration.from_pretrained(checkpoint_path, device_map="auto", **precision)
# save model
checkpoint_path = checkpoint_path + f"_{load_bit}"
model.save_pretrained(checkpoint_path)
Got it. Thanks a lot!
Hi Bo, I tried the following script, but I got an error. I want to train the model using LA dataset and I added the image_processor
and train_num_samples
back to the args. The command I used is
accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml
pipeline/train/instruction_following.py \
--pretrained_model_name_or_path=/path/to/OTTER-9B-INIT \
--mimicit_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_instructions.json,/path/toOtter/OneDrive_1_6-27-2023/LA/LADD_instructions.json" \
--images_path="/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json" \
--train_config_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LADD_train.json" \
--batch_size=16 \
--num_epoch=6 \
--external_save_dir=./checkpoints \
--save_hf_model \
--workers=8 \
--cross_attn_every_n_layers=4 \
--lr_schedule=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01
The error I got is
Traceback (most recent call last):
File "/path/to/Otter/pipeline/train/instruction_following.py", line 498, in <module>
main()
File "/path/to/Otter/pipeline/train/instruction_following.py", line 451, in main
model, optimizer, lr_scheduler, mimicit_loaders = accelerator.prepare(model, optimizer, lr_scheduler, mimicit_loaders)
File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare
result = tuple(
File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model
torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model
AttributeError: 'function' object has no attribute '__func__'
Could you please give me some suggestions? I am very new to the accelerate package.
The GPU I used is A100 and I think it is not the GPU isssue.
Thanks!
Hi May I know if you properly install the packages, how about the accelerate
, transformers
and torch
version?
If the issue still exists, you could switch to yhzhang/dev_laion400m
branch to try to run the same script. This branch has the latest changes of our code.
Yes. My software versions are as follows.
transformers
: 4.30.2
accelerate
: 0.20.3
torch
: 2.0.1
Sure. I can try the latest branch.
Thanks!
Hi Bo, I tried the following script, but I got an error. I want to train the model using LA dataset and I added the
image_processor
andtrain_num_samples
back to the args. The command I used isaccelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml pipeline/train/instruction_following.py \ --pretrained_model_name_or_path=/path/to/OTTER-9B-INIT \ --mimicit_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_instructions.json,/path/toOtter/OneDrive_1_6-27-2023/LA/LADD_instructions.json" \ --images_path="/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json" \ --train_config_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LADD_train.json" \ --batch_size=16 \ --num_epoch=6 \ --external_save_dir=./checkpoints \ --save_hf_model \ --workers=8 \ --cross_attn_every_n_layers=4 \ --lr_schedule=cosine \ --learning_rate=1e-5 \ --warmup_steps_ratio=0.01
The error I got is
Traceback (most recent call last): File "/path/to/Otter/pipeline/train/instruction_following.py", line 498, in <module> main() File "/path/to/Otter/pipeline/train/instruction_following.py", line 451, in main model, optimizer, lr_scheduler, mimicit_loaders = accelerator.prepare(model, optimizer, lr_scheduler, mimicit_loaders) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare result = tuple( File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model AttributeError: 'function' object has no attribute '__func__'
Could you please give me some suggestions? I am very new to the accelerate package.
The GPU I used is A100 and I think it is not the GPU isssue.
Thanks!
I also encountered this problem and updating the accelerate package to the GitHub version worked for me (see this PR https://github.com/huggingface/accelerate/pull/1637).
Hi Bo, I tried the following script, but I got an error. I want to train the model using LA dataset and I added the
image_processor
andtrain_num_samples
back to the args. The command I used isaccelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml pipeline/train/instruction_following.py \ --pretrained_model_name_or_path=/path/to/OTTER-9B-INIT \ --mimicit_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_instructions.json,/path/toOtter/OneDrive_1_6-27-2023/LA/LADD_instructions.json" \ --images_path="/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json" \ --train_config_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LADD_train.json" \ --batch_size=16 \ --num_epoch=6 \ --external_save_dir=./checkpoints \ --save_hf_model \ --workers=8 \ --cross_attn_every_n_layers=4 \ --lr_schedule=cosine \ --learning_rate=1e-5 \ --warmup_steps_ratio=0.01
The error I got is
Traceback (most recent call last): File "/path/to/Otter/pipeline/train/instruction_following.py", line 498, in <module> main() File "/path/to/Otter/pipeline/train/instruction_following.py", line 451, in main model, optimizer, lr_scheduler, mimicit_loaders = accelerator.prepare(model, optimizer, lr_scheduler, mimicit_loaders) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare result = tuple( File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model AttributeError: 'function' object has no attribute '__func__'
Could you please give me some suggestions? I am very new to the accelerate package.
The GPU I used is A100 and I think it is not the GPU isssue.
Thanks!
I also meet this problem, anyone could offer me a hand?
Hi Bo, I tried the following script, but I got an error. I want to train the model using LA dataset and I added the
image_processor
andtrain_num_samples
back to the args. The command I used isaccelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml pipeline/train/instruction_following.py \ --pretrained_model_name_or_path=/path/to/OTTER-9B-INIT \ --mimicit_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_instructions.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_instructions.json,/path/toOtter/OneDrive_1_6-27-2023/LA/LADD_instructions.json" \ --images_path="/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json,/path/to/Otter/mimic-it/convert-it/output/LA.json" \ --train_config_path="/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_I2I_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACR_T2T_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LACONV_train.json,/path/to/Otter/OneDrive_1_6-27-2023/LA/LADD_train.json" \ --batch_size=16 \ --num_epoch=6 \ --external_save_dir=./checkpoints \ --save_hf_model \ --workers=8 \ --cross_attn_every_n_layers=4 \ --lr_schedule=cosine \ --learning_rate=1e-5 \ --warmup_steps_ratio=0.01
The error I got is
Traceback (most recent call last): File "/path/to/Otter/pipeline/train/instruction_following.py", line 498, in <module> main() File "/path/to/Otter/pipeline/train/instruction_following.py", line 451, in main model, optimizer, lr_scheduler, mimicit_loaders = accelerator.prepare(model, optimizer, lr_scheduler, mimicit_loaders) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1182, in prepare result = tuple( File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1183, in <genexpr> self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1022, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/path/to/miniconda/envs/otter/lib/python3.9/site-packages/accelerate/accelerator.py", line 1311, in prepare_model torch.autocast(device_type=self.device.type, dtype=torch.bfloat16)(model.forward.__func__), model AttributeError: 'function' object has no attribute '__func__'
Could you please give me some suggestions? I am very new to the accelerate package. The GPU I used is A100 and I think it is not the GPU isssue. Thanks!
I also meet this problem, anyone could offer me a hand?
See the above post at the first please.
We put our configs here.
You can use 1 GPU with at least 33-34G memory to start training. If you load the model with bf16 or fp16, you can lower the GPU requirement to 16-20G I think. But we didnt try to train a model under bf16 or fp16 (we actually use mixed precision).
You can refer the following code to load a model under certain precisions.
import os import torch import argparse from .configuration_flamingo import FlamingoConfig from .modeling_flamingo import FlamingoForConditionalGeneration parser = argparse.ArgumentParser(description="Load model with precision") parser.add_argument('--load_bit', type=str, choices=['fp16', 'bf16'], required=True, help="Choose either 'fp16' or 'bf16'") args = parser.parse_args() load_bit = args.load_bit if load_bit == "fp16": precision = {"torch_dtype": torch.float16} elif load_bit == "bf16": precision = {"torch_dtype": torch.bfloat16} checkpoint_path = f"luodian/OTTER-9B-INIT" model = FlamingoForConditionalGeneration.from_pretrained(checkpoint_path, device_map="auto", **precision) # save model checkpoint_path = checkpoint_path + f"_{load_bit}" model.save_pretrained(checkpoint_path)
Thanks very much! I've implement it and trained on fp16, also change the data into torch.float16, but meet the error below:
Does anyone figure it out?