MPP-LLaVA
MPP-LLaVA copied to clipboard
微调训练视频数据读取问题
在二阶段微调训练时,会输出video有问题的信息:
Error loading data at index 23971: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_IrTqW6Qn8mI.mp4
Error loading data at index 71068: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_aV5DMcsNMmk.mp4
Error loading data at index 51648: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_MlbM7Mew0Ys.mp4
Error loading data at index 80768: 'video'
Error loading data at index 29235: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_AA1wvSZ4Mno.mp4
Error loading data at index 81059: 'video'
Error loading data at index 99616: 'video'
Error loading data at index 80812: 'video'
Error loading data at index 91812: 'video'
Error loading data at index 20455: Video not found: /dev/shm/vlm/MiniGPT4Qwen/cache/dataset/videochatgpt/activitynet_videos/v_J_SD_hhGET8.mp4
......
分析发现是一些视频数据没有截取到图像帧,在这里的ret
会返回False(一部分视频返回False,其他视频能够正常返回True,返回False的视频路径对应的视频存在于数据集中):
https://github.com/Coobiw/MPP-LLaVA/blob/cfd419c3a156f747fe25871e6a1eeb4beeb9fe0c/lavis/datasets/datasets/video_instructions.py#L43
导致输出信息 https://github.com/Coobiw/MPP-LLaVA/blob/cfd419c3a156f747fe25871e6a1eeb4beeb9fe0c/lavis/datasets/datasets/video_instructions.py#L55
说明这里没有在视频中截取到图像,但是我把报错视频下载下来,发现视频没有问题。现在不知道问题出在哪里。
附配置文件 sft.yaml :
model:
arch: minigpt4qwen
model_type: qwen7b_chat
load_finetuned: True
load_pretrained: True
# pretrained: "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/blip2_pretrained_flant5xxl.pth"
pretrained: "ckpt/blip2/blip2_pretrained_flant5xxl.pth"
finetuned: "/dev/shm/vlm/MiniGPT4Qwen/lavis/output/pp_7b_video/pretrain/global_step295/model.pth"
# vit encoder
vit_model: "eva_clip_g"
image_size: 224
drop_path_rate: 0
use_grad_checkpoint: True
vit_precision: "fp16" # 如果你要打开vit进行训练,这里需要调整成fp32,否则如果开启amp混合精度训练会有问题(在scaler处报错,因为没有实现一个fp16的AdamW)
freeze_vit: True
unfreeze_pos_embed: False
# Q-Former
num_query_token: 32
qformer_text_input: False
freeze_qformer: True
freeze_queries: True
# projection
freeze_proj: False
# path to Vicuna checkpoint
llm_model: "/dev/shm/vlm/MiniGPT4Qwen/cache/ckpt/Qwen-7B-Chat"
# unfreeze LLM for better chat
freeze_llm: False
# lora config
get_lora: False
lora_alpha: 32
lora_r: 8
lora_dropout: 0.05
# text length when training
max_txt_len: 1536 # 512
# enable autocast of vit
enable_autocast: False
datasets:
llava_instruct_156k: # name of the dataset builder
vis_processor:
train:
name: "blip2_image_train"
image_size: 224
text_processor:
train:
name: "base_instruction"
max_words: 200
videochatgpt_100k: # name of the dataset builder
vis_processor:
train:
name: "blip2_image_train"
image_size: 224
text_processor:
train:
name: "base_instruction"
max_words: 200
run:
output_dir: "lavis/output/pp_7b_video/sft_video/"
task: deepspeed_image_text_pretrain
num_workers: 4
seed: 42
world_size: 1
dist_url: "env://"
distributed: True
max_epoch: 1
log_freq: 10
lr_sched: "linear_warmup_cosine_lr_step-wise"
warmup_lr: 0
init_lr: 2e-5
min_lr: 0
warmup_ratio: 0.1
deepspeed_config:
# global batch = 128 = n_ranks * grad_acc_steps * micro_batch_size = (4//2) * 64 * 1
# 8 x 3090
# pp=8 dp=1 nproc=pp*dp=8
gradient_accumulation_steps: 128 # 128 // dp(=1) // bs_per_gpu(=1) = 128
train_micro_batch_size_per_gpu: 1
gradient_clipping: 1.
steps_per_print: 10
wall_clock_breakdown: false
dump_state: False
fp16:
enabled: false
loss_scale: 0
loss_scale_window: 1000
initial_scale_power: 16
hysteresis: 2
min_loss_scale: 1
bf16:
enabled: true
optimizer:
type: "AdamW"
params:
lr: 2e-5
betas: [0.9,0.99]
eps: 1e-7
weight_decay: 0.
zero_optimization:
stage: 0
# offload_optimizer:
# device: "cpu"
# pin_memory: true
allgather_partitions: true
allgather_bucket_size: 2e8
overlap_comm: true
reduce_scatter: true
reduce_bucket_size: 2e8
contiguous_gradients: true