VLMEvalKit
VLMEvalKit copied to clipboard
How to run VITA1.0 on 4*80G GPUs?
When I test VITA1.0 on my own dataset, it raise error "CUDA out of memory". The error occurs when load the model VITAMixtralForCausalLM I have tried input 4,8,16 frames.
VITA1.0 use tensor_parallel_size in its own repo, but it seems that it cannot be applied directly on VLMEvalKit version.
How to fix it?
Hi,
Thank you for your inquiry!
Upon reviewing the load_pretrained_model function in the VITA repository, I noticed that the original device_map configuration is currently set up to support only two devices. Here's the default mapping:
if model_type == "mixtral-8x7b":
# import pdb; pdb.set_trace()
device_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
"model.layers.2": 0,
"model.layers.3": 0,
"model.layers.4": 0,
"model.layers.5": 0,
"model.layers.6": 0,
"model.layers.7": 0,
"model.layers.8": 0,
"model.layers.9": 0,
"model.layers.10": 0,
"model.layers.11": 0,
"model.layers.12": 0,
"model.layers.13": 0,
"model.layers.14": 0,
"model.layers.15": 0,
"model.layers.16": 1,
"model.layers.17": 1,
"model.layers.18": 1,
"model.layers.19": 1,
"model.layers.20": 1,
"model.layers.21": 1,
"model.layers.22": 1,
"model.layers.23": 1,
"model.layers.24": 1,
"model.layers.25": 1,
"model.layers.26": 1,
"model.layers.27": 1,
"model.layers.28": 1,
"model.layers.29": 1,
"model.layers.30": 1,
"model.layers.31": 1,
"model.norm": 1,
"model.vision_tower": 1,
"model.mm_projector": 1,
"model.audio_encoder": 1,
"lm_head": 1,
}
If you'd like to run VITA across four devices, we recommend modifying the device_map to distribute the model layers evenly across devices 0, 1, 2, 3. To do this, you can manually update the device_map dictionary.
If you need further assistance with modifying the device_map or encounter any issues during the process, feel free to reach out. We’re happy to help!
Best regards
Thanks for you r response.
I have modified the device_map as following:
device_map = {
"model.embed_tokens": 0,
"model.layers.0": 0,
"model.layers.1": 0,
"model.layers.2": 0,
"model.layers.3": 0,
"model.layers.4": 0,
"model.layers.5": 0,
"model.layers.6": 0,
"model.layers.7": 0,
"model.layers.8": 2,
"model.layers.9": 2,
"model.layers.10": 2,
"model.layers.11": 2,
"model.layers.12": 2,
"model.layers.13": 2,
"model.layers.14": 2,
"model.layers.15": 2,
"model.layers.16": 2,
"model.layers.17": 1,
"model.layers.18": 1,
"model.layers.19": 1,
"model.layers.20": 1,
"model.layers.21": 1,
"model.layers.22": 1,
"model.layers.23": 1,
"model.layers.24": 1,
"model.layers.25": 1,
"model.layers.26": 3,
"model.layers.27": 3,
"model.layers.28": 3,
"model.layers.29": 3,
"model.layers.30": 3,
"model.layers.31": 3,
"model.norm": 3,
"model.vision_tower": 3,
"model.mm_projector": 3,
"model.audio_encoder": 3,
"lm_head": 3,
}
device_map["model.audio_encoder"] = 0
kwargs.update(device_map=device_map)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = VITAMixtralForCausalLM.from_pretrained(
model_path, low_cpu_mem_usage=True, **kwargs
)
However, the same error still exists. Should I try to use more devices?
Could you please share your execution command?
This is my config json:
{
"model": {
"VITA": {
"class": "VITA",
"model_path": "/workspaces/VLMEvalKit/ckpt/VITA/VITA_ckpt"
}
},
"data": {
"EVU": {
"class": "EVU",
"dataset": "EVU",
"nframe": 16
}
}
}
This is my running command:
export CUDA_LAUNCH_BLOCKING=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
AUTO_SPLIT=1 torchrun --nproc-per-node=4 run.py \
--config scripts/VITA.json \
--verbose
I know it. When you want to split one model on four devices, do not use --nproc_per_node=4. Instead, if there are four devices on your machine, please use --nproc_per_node=1.
Thanks for your advice. However, a new error raises:
[2025-02-24 09:43:14,774] ERROR - RUN - run.py: main - 428: Model VITA x Dataset EVU combination failed: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm), skipping this combination.
Traceback (most recent call last):
File "/workspaces/VLMEvalKit/run.py", line 294, in main
model = infer_data_job_video(
^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 157, in infer_data_job_video
model = infer_data(
^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 122, in infer_data
response = model.generate(message=struct, dataset=dataset_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/vlm/base.py", line 117, in generate
return self.generate_inner(message, dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/vlm/vita.py", line 641, in generate_inner
cont = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 2223, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 3211, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 176, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 278, in forward
return super().forward(
^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 158, in custom_forward
outputs = self.model(
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 670, in forward
position_embeddings = self.rotary_emb(hidden_states, position_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 444, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
[2025-02-24 09:43:14] ERROR - run.py: main - 428: Model VITA x Dataset EVU combination failed: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm), skipping this combination.
Traceback (most recent call last):
File "/workspaces/VLMEvalKit/run.py", line 294, in main
model = infer_data_job_video(
^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 157, in infer_data_job_video
model = infer_data(
^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 122, in infer_data
response = model.generate(message=struct, dataset=dataset_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/vlm/base.py", line 117, in generate
return self.generate_inner(message, dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/vlm/vita.py", line 641, in generate_inner
cont = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 2223, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 3211, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 176, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 278, in forward
return super().forward(
^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 158, in custom_forward
outputs = self.model(
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 670, in forward
position_embeddings = self.rotary_emb(hidden_states, position_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 444, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
We recommend that you try using the original device_map in combination with the --nproc_per_node=1 flag to verify if this resolves the issue. Set the cuda_device to 0,1.
Thanks for your patience. I have used the original device_map, and ran the command as follow:
export CUDA_LAUNCH_BLOCKING=1
export CUDA_VISIBLE_DEVICES=0,1
AUTO_SPLIT=1 torchrun --nproc-per-node=1 run.py \
--config scripts/eval_config/VITA.json \
--verbose
But the same error still exists.
@lxysl We’ve encountered some issues related to the VITA-Qwen model. Could you please assist us in resolving them?
I seem to have encountered this issue before, but at that time, I was working with vita-1.5, as I am not a developer for vita-1.0.
The problem likely lies in this line of code: https://github.com/VITA-MLLM/VITA/blob/8310b38aa909748368774bd88c7fa6ee26d02f4b/vita/model/builder.py#L292.
The vision_tower in the device_map is on cuda:1, so transformers automatically moves the images to cuda:1. However, on line 292, vision_tower.to(device) is called again, where device is cuda:0.
Therefore, changing
vision_tower.to(device=device, dtype=torch.float16)
to
vision_tower.to(dtype=torch.float16)
should resolve the issue. Please give it a try.
Thanks for your help. I install VITA via the repo you provide. But it doesn't solve the problem:
[2025-02-25 08:58:56] ERROR - run.py: main - 428: Model VITA x Dataset EVU combination failed: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm), skipping this combination.
Traceback (most recent call last):
File "/workspaces/VLMEvalKit/run.py", line 294, in main
model = infer_data_job_video(
^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 157, in infer_data_job_video
model = infer_data(
^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 122, in infer_data
response = model.generate(message=struct, dataset=dataset_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/vlm/base.py", line 117, in generate
return self.generate_inner(message, dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/vlm/vita.py", line 641, in generate_inner
cont = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 2223, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 3211, in _sample
outputs = self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 176, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 286, in forward
return super().forward(
^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 158, in custom_forward
outputs = self.model(
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 670, in forward
position_embeddings = self.rotary_emb(hidden_states, position_ids)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 444, in forward
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
I will try to eval from the VITA original repo. Thanks for all your help!
I am curious why the devices are cpu and cuda:0 instead of cuda:0 and cuda:1.
Can you just give this command a try?
CUDA_VISIBLE_DEVICES=0,1 python run.py --data MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar MMMU_DEV_VAL MathVista_MINI HallusionBench AI2D_TEST OCRBench MMVet MME --model vita --verbose
By the way, the path of the vita-1.0 repository VITA_ROOT should be the same as the vita-1.5, which means you don't need to use git checkout to switch branches.
Sure, I use the same repo with vita-1.5
Traceback (most recent call last):
File "/workspaces/VLMEvalKit/run.py", line 312, in main
model = infer_data_job(
^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/inference.py", line 165, in infer_data_job
model = infer_data(
^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/inference.py", line 99, in infer_data
model = supported_VLM[model_name]() if isinstance(model, str) else model
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vlmeval/vlm/vita.py", line 392, in __init__
tokenizer, model, image_processor, _ = load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^
File "/workspaces/VLMEvalKit/vita/model/builder.py", line 233, in load_pretrained_model
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 901, in from_pretrained
config = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1112, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in VITA-MLLM/VITA. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, aria, aria_text, audio-spectrogram-transformer, autoformer, bamba, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, cohere2, colpali, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dab-detr, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, depth_pro, deta, detr, diffllama, dinat, dinov2, dinov2_with_registers, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, emu3, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, got_ocr2, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, granitemoeshared, granitevision, graphormer, grounding-dino, groupvit, helium, hiera, hubert, ibert, idefics, idefics2, idefics3, idefics3_vision, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, modernbert, moonshine, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_5_vl, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rt_detr_v2, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superglue, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, textnet, time_series_transformer, timesformer, timm_backbone, timm_wrapper, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vitpose, vitpose_backbone, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zamba2, zoedepth, vita-mixtral, vita-Mistral, vita-Qwen2, vita-fo-Qwen2
I checked and confirmed that in the config.json file of the vita-1.0 repository on Hugging Face, there is a model_type param:https://huggingface.co/VITA-MLLM/VITA/blob/main/VITA_ckpt/config.json