InternVideo icon indicating copy to clipboard operation
InternVideo copied to clipboard

Not able to reproduce the stage two models

Open formanuscriptsharing opened this issue 3 months ago • 0 comments

during model loading the check points weight and vocab size seems to be wrong below is the code I used to generate this result, which has also been mentioned by others, I also tried clip and other models, the reuslts seems to be pretty bad when transferring to other dataset

text: A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run. ~ prob: 0.6796 text: A man in a gray hat and coat walks through the snowy yard, carefully navigating around the trees. ~ prob: 0.0944 text: A person dressed in a blue jacket shovels the snow-covered pavement outside their house. ~ prob: 0.0754 text: A person stands on the snowy floor, pushing a sled loaded with blankets, preparing for a fun-filled ride. ~ prob: 0.0375 text: A playful dog slides down a snowy hill, wagging its tail with delight. ~ prob: 0.0288

Looking for your guidance.

`import numpy as np import os import io import cv2 os.environ['CUDA_LAUNCH_BLOCKING']='1' import torch

from demo_config import (Config, eval_dict_leaf)

from demo.utils import (retrieve_text, _frame_from_video, setup_internvideo2) seed = 4491734 print("Seed:", seed)

np.random.seed(seed) torch.manual_seed(seed) if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed) video = cv2.VideoCapture('demo/example1.mp4') frames = [x for x in _frame_from_video(video)] text_candidates = ["A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.", "A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.", "A person dressed in a blue jacket shovels the snow-covered pavement outside their house.", "A pet dog excitedly runs through the snowy yard, chasing a toy thrown by its owner.", "A person stands on the snowy floor, pushing a sled loaded with blankets, preparing for a fun-filled ride.", "A man in a gray hat and coat walks through the snowy yard, carefully navigating around the trees.", "A playful dog slides down a snowy hill, wagging its tail with delight.", "A person in a blue jacket walks their pet on a leash, enjoying a peaceful winter walk among the trees.", "A man in a gray sweater plays fetch with his dog in the snowy yard, throwing a toy and watching it run.", "A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery."] #%% config = Config.from_file('demo/internvideo2_stage2_config.py') config = eval_dict_leaf(config) #%%

config['pretrained_path'] = '/InternVideo/InternVideo2/multi_modality/weights/InternVideo2-stage2_1b-224p-f4.pt',

intern_model, tokenizer = setup_internvideo2(config) #%% texts, probs = retrieve_text(frames, text_candidates, model=intern_model.eval(), topk=5, config=config)

for t, p in zip(texts, probs): print(f'text: {t} ~ prob: {p:.4f}')`

formanuscriptsharing avatar Sep 14 '25 04:09 formanuscriptsharing