cambrian Support GPU Training

This PR supports training Cambrian-8b on GPU with deepspeed zero2. Main modification includes

We remove all the .float() that satisfies TPU's precision and use bf16 unifiedly.
We fix the llama3 chat template bug and tokenizer bug which adds additional bos token
We revert checkpoint loading and resuming impl to the default Huggingface Trainer impl
We optimize the loading of data and model

Jul 26 '24 06:07 TideDra

@TideDra @flyskywalkerlby thanks so much for your contribution!

I'll look through the code and add any comments/suggestions. We will also have to fire off some TPU training runs to verify that nothing impacts our TPU training before merging.

Jul 29 '24 12:07 ellisbrown

No problem. We changed this path to verify the experimental results.

发自我的iPhone

------------------ Original ------------------ From: Ellis Brown @.> Date: Mon, Jul 29, 2024 9:29 PM To: cambrian-mllm/cambrian @.> Cc: Boyang Liu @.>, Mention @.> Subject: Re: [cambrian-mllm/cambrian] Support GPU Training (PR #64)

@ellisbrown commented on this pull request.

In inference.py: > -model_path = os.path.expanduser("nyu-visionx/cambrian-8b") +model_path = os.path.expanduser("./checkpoints/cambrian-8b-finetune")
let's not change this to preserve the default behavior?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Jul 29 '24 13:07 flyskywalkerlby

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

Oct 24 '24 09:10 wufeim

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

Oct 24 '24 12:10 TideDra

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

Got it, thanks!

Oct 24 '24 14:10 wufeim

Is there a reason why the checkpoint saving uses torch.save()? It seems that the full model weights are stored per rank instead of the sharded model weights, so the overall size of the checkpoints is huge

Oct 28 '24 17:10 dfan

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results?

Oct 28 '24 18:10 wufeim

Hi @TideDra , thanks for sharing the code for GPU training. Do you also have the code for evaluating Cambrian-1? Or is it straightforward to modify the scripts from the LLaVA codebase? Thanks!

VLMEvalKit supports to evaluate cambrian.

It seems to me there are some issues with the VLMEvalKit codebase when evaluating LLaVA on TextVQA. With the released LLaVA or models trained with this cambrian+gpu code, I couldn't reproduce the results reproduced in the LLaVA v1.5 paper. Not sure what's the difference between the evaluations but we probably need to modify the evaluation code from LLaVA to reproduce exact results?

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

Oct 28 '24 18:10 ellisbrown

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

Oh I see it now. Thanks so much! I will check it out.

I was looking at the documentation here and thought they were not out yet. Maybe update the link in the README?

Oct 28 '24 19:10 wufeim

@wufeim note that we have released our eval code here https://github.com/cambrian-mllm/cambrian/tree/main/eval

Hi @ellisbrown , quick questions on the evaluation code:

It seems that eval/requirements.txt is missing. I guess mainly the datasets package?
When I was evaluating cambrian-8b with the following command, all four GPUs are evaluating on the whole TextVQA instead of one of the four subparts. Is this correct? Or am I using a wrong command?
```
bash scripts/run_benchmark.sh --benchmark textvqa --ckpt nyu-visionx/cambrian-8b --conv_mode llama_3
```

Oct 29 '24 05:10 wufeim

+1 the eval/requirements.txt is missing. It'd be nice to know if a specific version of datasets is needed

Oct 30 '24 01:10 dfan

@wufeim @dfan sorry the requirements was masked by gitignore. added in #82

2. When I was evaluating cambrian-8b with the following command, all four GPUs are evaluating on the whole TextVQA instead of one of the four subparts. Is this correct? Or am I using a wrong command?

@wufeim have a read through run_benchmark.sh. the questions are chunked and each gpu handles one chunk.

let's move further discussion unrelated to this GPU training PR #64 to separate issues please.

Oct 30 '24 01:10 ellisbrown

Hi @TideDra,

I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3?

Thanks!

Nov 01 '24 09:11 wufeim

Hi @TideDra,

I'm trying out the GPU training code. I see that you used zero2 for both pretraining and finetuning. Meanwhile LLaVA used zero2 for pretraining and zero3 for finetuning. I am not an expert with deepspeed but I did encounter some issues with zero3, possibly related to this. Did you have similar issues? Or how did you decide on zero 2/3?

Thanks!

In general, zero3 reduces GPU memory usage while increases training time, compared with zero2, but does not affect model perfomance theoretically. So zero2 is preferred if memory is enough. I didn't try zero3 so I didn't encounter your issue :). But actually, zero3 indeed has more bugs than zero2 in practice.

Nov 01 '24 10:11 TideDra

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1

Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.

Environment:

Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1

Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2

Configuration:

DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16

Steps to Reproduce:

Run the training script with the following command:

deepspeed
--num_nodes $SLURM_JOB_NUM_NODES
--num_gpus $SLURM_GPUS_PER_NODE
--master_addr localhost
--master_port 12345
--hostfile hostfile_temp
--no_ssh_check
cambrian/train/train_gpu.py
--deepspeed ./scripts/zero2.json
--model_name_or_path $ROOT_DIR/LLM/llama3-llava-next-8b
--version llama_v3
--data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl"
--image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/"
--pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin"
--vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]'
--vision_tower_aux_token_len_list '[576, 576, 576, 9216]'
--image_token_len 576
--num_query_group 1
--query_num_list '[576]'
--connector_depth 3
--image_position 91
--vision_hidden_size 1024
--connector_only False
--num_of_vision_sampler_layers 10
--start_of_vision_sampler_layers 0
--stride_of_vision_sampler_layers 3
--mm_projector_type sva
--unfreeze_mm_vision_tower False
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--image_aspect_ratio pad
--group_by_modality_length True
--bf16 True
--output_dir $CKPT_DIR
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 500
--save_total_limit 5
--learning_rate 4e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--run_name $CKPT_NAME
--report_to wandb

Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.

Nov 07 '24 15:11 nku-zhichengzhang

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1

Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists.

Environment:

Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1

Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2

Configuration:

DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16

Steps to Reproduce:

Run the training script with the following command:

deepspeed --num_nodes $SLURM_JOB_NUM_NODES --num_gpus $SLURM_GPUS_PER_NODE --master_addr localhost --master_port 12345 --hostfile hostfile_temp --no_ssh_check cambrian/train/train_gpu.py --deepspeed ./scripts/zero2.json --model_name_or_path $ROOT_DIR/LLM/llama3-llava-next-8b --version llama_v3 --data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl" --image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/" --pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin" --vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]' --vision_tower_aux_token_len_list '[576, 576, 576, 9216]' --image_token_len 576 --num_query_group 1 --query_num_list '[576]' --connector_depth 3 --image_position 91 --vision_hidden_size 1024 --connector_only False --num_of_vision_sampler_layers 10 --start_of_vision_sampler_layers 0 --stride_of_vision_sampler_layers 3 --mm_projector_type sva --unfreeze_mm_vision_tower False --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir $CKPT_DIR --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 4e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --run_name $CKPT_NAME --report_to wandb

Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.

How many gpu do you use? Besides we only test with vicuna-7b.

Nov 07 '24 16:11 TideDra

OOM when finetuning Cambrian in the 80GB A100 cluster even with a batch size of 1 Description: I am encountering an Out-of-Memory (OOM) error while attempting to fine-tune the Cambrian model using A100 GPUs. Despite setting the batch size and gradient accumulation steps to 1, the issue persists. Environment: Model: Cambrian GPU: A100 DeepSpeed Version: 0.14.4 CUDA Version: 12.1 PyTorch Version: 2.3.1 Python Packages: accelerate 0.32.1 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.10 aiosignal 1.3.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 asttokens 2.4.1 async-timeout 4.0.3 attrs 24.2.0 bitsandbytes 0.43.1 cachetools 5.5.0 cambrian 1.0.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 colorama 0.4.6 coloredlogs 15.0.1 contourpy 1.3.0 cos-python-sdk-v5 1.9.32 crcmod 1.7 cycler 0.12.1 decorator 5.1.1 deepspeed 0.14.4 diffusers 0.31.0 docker-pycreds 0.4.0 einops 0.8.0 einops-exts 0.0.4 exceptiongroup 1.2.2 executing 2.1.0 EzColorLog 1.0.3 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.16.1 flash-attn 2.6.3 fonttools 4.54.1 frozenlist 1.5.0 fsspec 2024.10.0 ftfy 6.3.1 gcsfs 2024.10.0 gitdb 4.0.11 GitPython 3.1.43 google-api-core 2.22.0 google-auth 2.36.0 google-auth-oauthlib 1.2.1 google-cloud-core 2.4.1 google-cloud-storage 2.18.2 google-crc32c 1.6.0 google-resumable-media 2.7.2 googleapis-common-protos 1.65.0 gradio 4.16.0 gradio_client 0.8.1 h11 0.14.0 hf_transfer 0.1.8 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 humanfriendly 10.0 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 ipython 8.29.0 jedi 0.19.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 matplotlib-inline 0.1.7 mdurl 0.1.2 mpmath 1.3.0 multidict 6.1.0 narwhals 1.13.2 networkx 3.4.2 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 open_clip_torch 2.29.0 orjson 3.10.11 packaging 24.1 pandas 2.2.3 parso 0.8.4 peewee 3.17.7 peft 0.11.1 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.3.6 prompt_toolkit 3.0.48 propcache 0.2.0 proto-plus 1.25.0 protobuf 5.28.3 psutil 6.1.0 ptyprocess 0.7.0 pure_eval 0.2.3 py-cpuinfo 9.0.0 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycryptodome 3.21.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.21.0 rsa 4.9 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.5.1 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.18.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 stack-data 0.6.3 starlette 0.41.2 svgwrite 1.4.3 swanboard 0.1.4b2 swankit 0.1.1b3 swanlab 0.3.23 sympy 1.13.3 threadpoolctl 3.5.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 torch 2.3.1 torchtext 0.18.0 torchvision 0.18.1 tqdm 4.67.0 traitlets 5.14.3 transformers 4.42.4 triton 2.3.1 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 ujson 5.10.0 urllib3 2.2.3 uvicorn 0.32.0 wandb 0.18.6 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.44.0 xmltodict 0.14.2 yarl 1.17.1 zipp 3.20.2 Configuration: DeepSpeed Config: Using ZeRO stage 2 (attach zero2.json if possible) Batch Size: 1 Gradient Accumulation Steps: 1 Mixed Precision: bf16 Steps to Reproduce: Run the training script with the following command: deepspeed --num_nodes $SLURM_JOB_NUM_NODES --num_gpus ROOT_DIR/LLM/llama3-llava-next-8b --version llama_v3 --data_path "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/jsons/Cambrian150K_withsystemprompt.jsonl" --image_folder "$ROOT_DIR/cambrian/datasets--nyu-visionx--Cambrian-10M/" --pretrain_mm_mlp_adapter "$ROOT_DIR/cambrian-models/models--nyu-visionx--cambrian-8b_projector/mm_projector.bin" --vision_tower_aux_list '["siglip/CLIP-ViT-SO400M-14-384", "openai/clip-vit-large-patch14-336", "facebook/dinov2-giant-res378", "clip-convnext-XXL-multi-stage"]' --vision_tower_aux_token_len_list '[576, 576, 576, 9216]' --image_token_len 576 --num_query_group 1 --query_num_list '[576]' --connector_depth 3 --image_position 91 --vision_hidden_size 1024 --connector_only False --num_of_vision_sampler_layers 10 --start_of_vision_sampler_layers 0 --stride_of_vision_sampler_layers 3 --mm_projector_type sva --unfreeze_mm_vision_tower False --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir $CKPT_DIR --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 5 --learning_rate 4e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --run_name $CKPT_NAME --report_to wandb Behavior: The training process fails due to OOM errors, even with minimal batch size and gradient accumulation settings.

How many gpu do you use? Besides we only test with vicuna-7b.

2 A100 by now, for Llama3 8B.

But I only assign 1 sample per GPU. It really confused me. Could you give me any suggestions?

Nov 07 '24 16:11 nku-zhichengzhang

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Nov 07 '24 16:11 TideDra

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

Nov 07 '24 16:11 nku-zhichengzhang

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories?

Nov 07 '24 16:11 TideDra

@nku-zhichengzhang it seems that 15.58G memories are reserved by pytorch but unallocated. You may follow the instruction given by the error, or try to clear cuda memories.

Loading the Llama3 model occupies 15GB of memory. BTW, how many GPUs did you use to train the Vicuna model?

We use at least 8 gpus for pretraining and 32 gpus for finetuning. You may try zero3, which requires less memories?

Okay, thanks for the reply.

Nov 07 '24 16:11 nku-zhichengzhang

@TideDra Have you verified the performance of the model trained with your GPU setting?

Jan 16 '25 02:01 futurev