VLMEvalKit icon indicating copy to clipboard operation
VLMEvalKit copied to clipboard

How to run VITA1.0 on 4*80G GPUs?

Open Espere-1119-Song opened this issue 9 months ago • 14 comments

When I test VITA1.0 on my own dataset, it raise error "CUDA out of memory". The error occurs when load the model VITAMixtralForCausalLM I have tried input 4,8,16 frames.

VITA1.0 use tensor_parallel_size in its own repo, but it seems that it cannot be applied directly on VLMEvalKit version.

How to fix it?

Image

Espere-1119-Song avatar Feb 23 '25 12:02 Espere-1119-Song

Hi,

Thank you for your inquiry!

Upon reviewing the load_pretrained_model function in the VITA repository, I noticed that the original device_map configuration is currently set up to support only two devices. Here's the default mapping:

if model_type == "mixtral-8x7b":
            # import pdb; pdb.set_trace()
            device_map = {
                "model.embed_tokens": 0,
                "model.layers.0": 0,
                "model.layers.1": 0,
                "model.layers.2": 0,
                "model.layers.3": 0,
                "model.layers.4": 0,
                "model.layers.5": 0,
                "model.layers.6": 0,
                "model.layers.7": 0,
                "model.layers.8": 0,
                "model.layers.9": 0,
                "model.layers.10": 0,
                "model.layers.11": 0,
                "model.layers.12": 0,
                "model.layers.13": 0,
                "model.layers.14": 0,
                "model.layers.15": 0,
                "model.layers.16": 1,
                "model.layers.17": 1,
                "model.layers.18": 1,
                "model.layers.19": 1,
                "model.layers.20": 1,
                "model.layers.21": 1,
                "model.layers.22": 1,
                "model.layers.23": 1,
                "model.layers.24": 1,
                "model.layers.25": 1,
                "model.layers.26": 1,
                "model.layers.27": 1,
                "model.layers.28": 1,
                "model.layers.29": 1,
                "model.layers.30": 1,
                "model.layers.31": 1,
                "model.norm": 1,
                "model.vision_tower": 1,
                "model.mm_projector": 1,
                "model.audio_encoder": 1,
                "lm_head": 1,
            }

If you'd like to run VITA across four devices, we recommend modifying the device_map to distribute the model layers evenly across devices 0, 1, 2, 3. To do this, you can manually update the device_map dictionary.

If you need further assistance with modifying the device_map or encounter any issues during the process, feel free to reach out. We’re happy to help!

Best regards

PhoenixZ810 avatar Feb 24 '25 02:02 PhoenixZ810

Thanks for you r response.

I have modified the device_map as following:

device_map = {
                "model.embed_tokens": 0,
                "model.layers.0": 0,
                "model.layers.1": 0,
                "model.layers.2": 0,
                "model.layers.3": 0,
                "model.layers.4": 0,
                "model.layers.5": 0,
                "model.layers.6": 0,
                "model.layers.7": 0,
                "model.layers.8": 2,
                "model.layers.9": 2,
                "model.layers.10": 2,
                "model.layers.11": 2,
                "model.layers.12": 2,
                "model.layers.13": 2,
                "model.layers.14": 2,
                "model.layers.15": 2,
                "model.layers.16": 2,
                "model.layers.17": 1,
                "model.layers.18": 1,
                "model.layers.19": 1,
                "model.layers.20": 1,
                "model.layers.21": 1,
                "model.layers.22": 1,
                "model.layers.23": 1,
                "model.layers.24": 1,
                "model.layers.25": 1,
                "model.layers.26": 3,
                "model.layers.27": 3,
                "model.layers.28": 3,
                "model.layers.29": 3,
                "model.layers.30": 3,
                "model.layers.31": 3,
                "model.norm": 3,
                "model.vision_tower": 3,
                "model.mm_projector": 3,
                "model.audio_encoder": 3,
                "lm_head": 3,
            }
device_map["model.audio_encoder"] = 0
kwargs.update(device_map=device_map)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = VITAMixtralForCausalLM.from_pretrained(
    model_path, low_cpu_mem_usage=True, **kwargs
)

However, the same error still exists. Should I try to use more devices?

Espere-1119-Song avatar Feb 24 '25 04:02 Espere-1119-Song

Could you please share your execution command?

PhoenixZ810 avatar Feb 24 '25 08:02 PhoenixZ810

This is my config json:

{
    "model": {
        "VITA": {
            "class": "VITA",
            "model_path": "/workspaces/VLMEvalKit/ckpt/VITA/VITA_ckpt"
        }
    },
    "data": {
        "EVU": {
            "class": "EVU",
            "dataset": "EVU",
            "nframe": 16
        }
    }
}

This is my running command:

export CUDA_LAUNCH_BLOCKING=1
export CUDA_VISIBLE_DEVICES=0,1,2,3

AUTO_SPLIT=1 torchrun --nproc-per-node=4 run.py \
    --config scripts/VITA.json \
    --verbose 

Espere-1119-Song avatar Feb 24 '25 08:02 Espere-1119-Song

I know it. When you want to split one model on four devices, do not use --nproc_per_node=4. Instead, if there are four devices on your machine, please use --nproc_per_node=1.

PhoenixZ810 avatar Feb 24 '25 08:02 PhoenixZ810

Thanks for your advice. However, a new error raises:

[2025-02-24 09:43:14,774] ERROR - RUN - run.py: main - 428: Model VITA x Dataset EVU combination failed: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm), skipping this combination.
Traceback (most recent call last):
  File "/workspaces/VLMEvalKit/run.py", line 294, in main
    model = infer_data_job_video(
            ^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 157, in infer_data_job_video
    model = infer_data(
            ^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 122, in infer_data
    response = model.generate(message=struct, dataset=dataset_name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/vlm/base.py", line 117, in generate
    return self.generate_inner(message, dataset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/vlm/vita.py", line 641, in generate_inner
    cont = self.model.generate(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 2223, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 3211, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 278, in forward
    return super().forward(
           ^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 158, in custom_forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 670, in forward
    position_embeddings = self.rotary_emb(hidden_states, position_ids)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 444, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
             ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
[2025-02-24 09:43:14] ERROR - run.py: main - 428: Model VITA x Dataset EVU combination failed: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm), skipping this combination.
Traceback (most recent call last):
  File "/workspaces/VLMEvalKit/run.py", line 294, in main
    model = infer_data_job_video(
            ^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 157, in infer_data_job_video
    model = infer_data(
            ^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 122, in infer_data
    response = model.generate(message=struct, dataset=dataset_name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/vlm/base.py", line 117, in generate
    return self.generate_inner(message, dataset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/vlm/vita.py", line 641, in generate_inner
    cont = self.model.generate(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 2223, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 3211, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 278, in forward
    return super().forward(
           ^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 158, in custom_forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 670, in forward
    position_embeddings = self.rotary_emb(hidden_states, position_ids)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 444, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
             ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

Espere-1119-Song avatar Feb 24 '25 09:02 Espere-1119-Song

We recommend that you try using the original device_map in combination with the --nproc_per_node=1 flag to verify if this resolves the issue. Set the cuda_device to 0,1.

PhoenixZ810 avatar Feb 24 '25 09:02 PhoenixZ810

Thanks for your patience. I have used the original device_map, and ran the command as follow:

export CUDA_LAUNCH_BLOCKING=1
export CUDA_VISIBLE_DEVICES=0,1

AUTO_SPLIT=1 torchrun --nproc-per-node=1 run.py \
    --config scripts/eval_config/VITA.json \
    --verbose 

But the same error still exists.

Espere-1119-Song avatar Feb 24 '25 10:02 Espere-1119-Song

@lxysl We’ve encountered some issues related to the VITA-Qwen model. Could you please assist us in resolving them?

PhoenixZ810 avatar Feb 25 '25 03:02 PhoenixZ810

I seem to have encountered this issue before, but at that time, I was working with vita-1.5, as I am not a developer for vita-1.0.

The problem likely lies in this line of code: https://github.com/VITA-MLLM/VITA/blob/8310b38aa909748368774bd88c7fa6ee26d02f4b/vita/model/builder.py#L292.

The vision_tower in the device_map is on cuda:1, so transformers automatically moves the images to cuda:1. However, on line 292, vision_tower.to(device) is called again, where device is cuda:0. Therefore, changing

vision_tower.to(device=device, dtype=torch.float16)

to

vision_tower.to(dtype=torch.float16)

should resolve the issue. Please give it a try.

lxysl avatar Feb 25 '25 03:02 lxysl

Thanks for your help. I install VITA via the repo you provide. But it doesn't solve the problem:

[2025-02-25 08:58:56] ERROR - run.py: main - 428: Model VITA x Dataset EVU combination failed: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm), skipping this combination.
Traceback (most recent call last):
  File "/workspaces/VLMEvalKit/run.py", line 294, in main
    model = infer_data_job_video(
            ^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 157, in infer_data_job_video
    model = infer_data(
            ^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/inference_video.py", line 122, in infer_data
    response = model.generate(message=struct, dataset=dataset_name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/vlm/base.py", line 117, in generate
    return self.generate_inner(message, dataset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/vlm/vita.py", line 641, in generate_inner
    cont = self.model.generate(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 2223, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/generation/utils.py", line 3211, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 286, in forward
    return super().forward(
           ^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vita/model/language_model/vita_mixtral.py", line 158, in custom_forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 670, in forward
    position_embeddings = self.rotary_emb(hidden_states, position_ids)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 444, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
             ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

I will try to eval from the VITA original repo. Thanks for all your help!

Espere-1119-Song avatar Feb 25 '25 09:02 Espere-1119-Song

I am curious why the devices are cpu and cuda:0 instead of cuda:0 and cuda:1. Can you just give this command a try?

CUDA_VISIBLE_DEVICES=0,1 python run.py --data MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar MMMU_DEV_VAL MathVista_MINI HallusionBench AI2D_TEST OCRBench MMVet MME --model vita --verbose

By the way, the path of the vita-1.0 repository VITA_ROOT should be the same as the vita-1.5, which means you don't need to use git checkout to switch branches.

lxysl avatar Feb 25 '25 09:02 lxysl

Sure, I use the same repo with vita-1.5

Traceback (most recent call last):
  File "/workspaces/VLMEvalKit/run.py", line 312, in main
    model = infer_data_job(
            ^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/inference.py", line 165, in infer_data_job
    model = infer_data(
            ^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/inference.py", line 99, in infer_data
    model = supported_VLM[model_name]() if isinstance(model, str) else model
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vlmeval/vlm/vita.py", line 392, in __init__
    tokenizer, model, image_processor, _ = load_pretrained_model(
                                           ^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/VLMEvalKit/vita/model/builder.py", line 233, in load_pretrained_model
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 901, in from_pretrained
    config = AutoConfig.from_pretrained(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1112, in from_pretrained
    raise ValueError(
ValueError: Unrecognized model in VITA-MLLM/VITA. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, aria, aria_text, audio-spectrogram-transformer, autoformer, bamba, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, cohere2, colpali, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dab-detr, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, depth_pro, deta, detr, diffllama, dinat, dinov2, dinov2_with_registers, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, emu3, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, got_ocr2, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, granitemoeshared, granitevision, graphormer, grounding-dino, groupvit, helium, hiera, hubert, ibert, idefics, idefics2, idefics3, idefics3_vision, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, modernbert, moonshine, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_5_vl, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rt_detr_v2, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superglue, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, textnet, time_series_transformer, timesformer, timm_backbone, timm_wrapper, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vitpose, vitpose_backbone, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zamba2, zoedepth, vita-mixtral, vita-Mistral, vita-Qwen2, vita-fo-Qwen2

Espere-1119-Song avatar Feb 25 '25 11:02 Espere-1119-Song

I checked and confirmed that in the config.json file of the vita-1.0 repository on Hugging Face, there is a model_type param:https://huggingface.co/VITA-MLLM/VITA/blob/main/VITA_ckpt/config.json

lxysl avatar Feb 25 '25 12:02 lxysl