Support GPU Training
This PR supports training Cambrian-8b on GPU with deepspeed zero2. Main modification includes
- We remove all the
.float()that satisfies TPU's precision and use bf16 unifiedly. - We fix the llama3 chat template bug and tokenizer bug which adds additional bos token
- We revert checkpoint loading and resuming impl to the default Huggingface Trainer impl
- We optimize the loading of data and model
@TideDra @flyskywalkerlby thanks so much for your contribution!
I'll look through the code and add any comments/suggestions. We will also have to fire off some TPU training runs to verify that nothing impacts our TPU training before merging.
No problem. We changed this path to verify the experimental results.
发自我的iPhone
------------------ Original ------------------ From: Ellis Brown @.> Date: Mon, Jul 29, 2024 9:29 PM To: cambrian-mllm/cambrian @.> Cc: Boyang Liu @.>, Mention @.> Subject: Re: [cambrian-mllm/cambrian] Support GPU Training (PR #64)
@ellisbrown commented on this pull request.
In inference.py:
> -model_path = os.path.expanduser("nyu-visionx/cambrian-8b") +model_path = os.path.expanduser("./checkpoints/cambrian-8b-finetune")
let's not change this to preserve the default behavior?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!
Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!
VLMEvalKit supports to evaluate cambrian.
Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!
VLMEvalKit supports to evaluate cambrian.
Got it, thanks!
Is there a reason why the checkpoint saving uses torch.save()? It seems that the full model weights are stored per rank instead of the sharded model weights, so the overall size of the checkpoints is huge
Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!
VLMEvalKit supports to evaluate cambrian.
It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results?
Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!
VLMEvalKit supports to evaluate cambrian.
It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results?
@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval
@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval
Oh I see it now. Thanks so much! I will check it out.
I was looking at the documentation here and thought they were not out yet. Maybe update the link in the README?
@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval
Hi @ellisbrown , quick questions on the evaluation code:
- It seems that
eval/requirements.txtis missing. I guess mainly thedatasetspackage? - When I was evaluating
cambrian-8bwith the following command, all four GPUs are evaluating on the whole TextVQA instead of one of the four subparts. Is this correct? Or am I using a wrong command?bash scripts/run_benchmark.sh --benchmark textvqa --ckpt nyu-visionx/cambrian-8b --conv_mode llama_3
+1 the eval/requirements.txt is missing. It'd be nice to know if a specific version of datasets is needed
@wufeim @dfan sorry the requirements was masked by gitignore. added in #82
2. When I was evaluating
cambrian-8bwith the following command, all four GPUs are evaluating on the whole TextVQA instead of one of the four subparts. Is this correct? Or am I using a wrong command?
@wufeim have a read through run_benchmark.sh. the questions are chunked and each gpu handles one chunk.
let's move further discussion unrelated to this GPU training PR #64 to separate issues please.
Hi @TideDra,
I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3?
Thanks!
Hi @TideDra,
I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3?
Thanks!
In general, zero3 reduces GPU memory usage while increases training time, compared with zero2, but does not affect model perfomance theoretically. So zero2 is preferred if memory is enough. I didn't try zero3 so I didn't encounter your issue :). But actually, zero3 indeed has more bugs than zero2 in practice.
OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1
Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.
Environment:
Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1
Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2
Configuration:
DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16
Steps to Reproduce:
Run the training script with the following command:
deepspeed
--num_nodes $SLURM_JOB_NUM_NODES
--num_gpus $SLURM_GPUS_PER_NODE
--master_addr localhost
--master_port 12345
--hostfile hostfile_temp
--no_ssh_check
cambrian/train/train_gpu.py
--deepspeed ./scripts/zero2.json
--model_name_or_path $ROOT_DIR/LLM/llama3-llava-next-8b
--version llama_v3
--data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl"
--image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/"
--pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin"
--vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]'
--vision_tower_aux_token_len_list '[576, 576, 576, 9216]'
--image_token_len 576
--num_query_group 1
--query_num_list '[576]'
--connector_depth 3
--image_position 91
--vision_hidden_size 1024
--connector_only False
--num_of_vision_sampler_layers 10
--start_of_vision_sampler_layers 0
--stride_of_vision_sampler_layers 3
--mm_projector_type sva
--unfreeze_mm_vision_tower False
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length True
--bf16 True
--output_dir $CKPT_DIR
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 500
--save_total_limit 5
--learning_rate 4e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--run_name $CKPT_NAME
--report_to wandb
Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.
OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1
Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.
Environment:
Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1
Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2
Configuration:
DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16
Steps to Reproduce:
Run the training script with the following command:
deepspeed --num_nodes $SLURM_JOB_NUM_NODES --num_gpus $SLURM_GPUS_PER_NODE --master_addr localhost --master_port 12345 --hostfile hostfile_temp --no_ssh_check cambrian/train/train_gpu.py --deepspeed ./scripts/zero2.json --model_name_or_path $ROOT_DIR/LLM/llama3-llava-next-8b --version llama_v3 --data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl" --image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/" --pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin" --vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]' --vision_tower_aux_token_len_list '[576, 576, 576, 9216]' --image_token_len 576 --num_query_group 1 --query_num_list '[576]' --connector_depth 3 --image_position 91 --vision_hidden_size 1024 --connector_only False --num_of_vision_sampler_layers 10 --start_of_vision_sampler_layers 0 --stride_of_vision_sampler_layers 3 --mm_projector_type sva --unfreeze_mm_vision_tower False --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir $CKPT_DIR --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 4e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --run_name $CKPT_NAME --report_to wandb
Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.
![]()
How many gpu do you use? Besides we only test with vicuna-7b.
OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1 Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists. Environment: Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1 Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2 Configuration: DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16 Steps to Reproduce: Run the training script with the following command: deepspeed --num_nodes $SLURM_JOB_NUM_NODES --num_gpus ROOT_DIR/LLM/llama3-llava-next-8b --version llama_v3 --data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl" --image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/" --pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin" --vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]' --vision_tower_aux_token_len_list '[576, 576, 576, 9216]' --image_token_len 576 --num_query_group 1 --query_num_list '[576]' --connector_depth 3 --image_position 91 --vision_hidden_size 1024 --connector_only False --num_of_vision_sampler_layers 10 --start_of_vision_sampler_layers 0 --stride_of_vision_sampler_layers 3 --mm_projector_type sva --unfreeze_mm_vision_tower False --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir $CKPT_DIR --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 4e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --run_name $CKPT_NAME --report_to wandb Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.
How many gpu do you use? Besides we only test with vicuna-7b.
2 A100 by now, for Llama3 8B.
But I only assign 1 sample per GPU. It really confused me. Could you give me any suggestions?
@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.
@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.
Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?
@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.
Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?
We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories?
@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.
Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?
We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories?
Okay, thanks for the reply.
@TideDra Have you verified the performance of the model trained with your GPU setting?