InternVideo icon indicating copy to clipboard operation
InternVideo copied to clipboard

[Help requested] Inference InternVideo2_clip model.

Open gracikk-ds opened this issue 1 year ago • 40 comments
trafficstars

Hello InternVideo team,

You guys have done a great job with this project!

In your paper, you use the Stage 2 model for the task of temporal grounding on QVHighlight [Lei et al., 2021] and Charade-STA [Gao et al., 2017]. I have a question, why not use the CLIP version for this purpose?

As you mentioned in one of the issues I saw, the CLIP one is fine-tuned from Stage 2 to support more applications (with the powerful InternVL text encoder).

Am I correct in understanding that you kept the video encoder model unchanged, and the BERT-L was replaced with another text encoder? If so, where can I obtain the weights for this encoder?

In the evaluation script, you use "your_model_path/internvl/internvl_c_13b_224px.pth", there is no such model in the InternVL repository.

@Andy1621

gracikk-ds avatar May 23 '24 13:05 gracikk-ds

Hi! The internvl_c_13b_224px can be found here. As for the previous question, I will let other co-authors answer, who is responsible for the grounding tasks.

Andy1621 avatar May 23 '24 14:05 Andy1621

@gracikk-ds Hello. We did use InternVL text encoder with 7B parameters for grounding tasks.

cg1177 avatar May 23 '24 15:05 cg1177

@cg1177,

Thank you for your response,

Could I know the metrics you obtained with this encoder? In the preprint, you have provided the metrics for Stage2.

Feature [email protected] [email protected] mAP mAP HiT@1
InternVideo2_s2-1B 70.00 54.45 47.02 42.36 69.74

gracikk-ds avatar May 24 '24 07:05 gracikk-ds

@gracikk-ds Hello. We did use InternVL text encoder with 7B parameters for grounding tasks.

Hi, thank you for the wonderful work! So performance of the 2 subtables of Table 13 in the paper is actually finetuned from InternVideo2_clip?But why the feature is InternVideo2_s2-6B and InternVideo2_s2-1B in the table? Thank you for your guidance!

tiesanguaixia avatar May 27 '24 11:05 tiesanguaixia

By the way, could you please provide more detail about how to use CG-DETR as the grounding head to do the moment retrieval task?

tiesanguaixia avatar May 27 '24 13:05 tiesanguaixia

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

cg1177 avatar May 27 '24 14:05 cg1177

@cg1177, could you please provide direct link to chinese_alpaca_lora_7b? :)

Am I correct in understanding that to reproduce your results, I need to follow these steps:

  1. Download the checkpoints, namely:

    • InternVideo2-stage2_1b-224p-f4.pt
    • 1B_clip.pth
    • chinese_alpaca_lora_7b ???
    • internvl_c_13b_224px.pth
  2. Initialize the InternVideo2_CLIP class, to which a config containing paths to the checkpoints mentioned above is passed. Additionally, load the 1B_clip.pth.

Alternatively, can I use the same video model as in the demo, and load the 1B_clip.pth weights into it? And just change the tokenizer and textual model to LLaMa?

@tiesanguaixia @Andy1621

gracikk-ds avatar May 28 '24 14:05 gracikk-ds

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

I've tried to train CGDETR based model on stage2_clip features that you have released and on stage2 features extracted by myself. image The difference is huge. Have you experimented with stage2 features? Or maybe do you made some changes in CGDETR to make in perform better on stage2_clip features?

@Andy1621 @cg1177

gracikk-ds avatar May 29 '24 07:05 gracikk-ds

@gracikk-ds Could you explain the plot?

cg1177 avatar May 29 '24 07:05 cg1177

It is the validation curves of 'HIT@1' metric for CGDETR-like models computed on validation dataset. You've posted the same metric in your paper, but for the test set. image There are 3 curves:

  1. curve 42_new_feat_neg_pos - model trained on stage_2 features, random state=42
  2. curve 41_test_llama_features - model trained on stage_2_clip features provided by you, random state=41
  3. curve 42_test_llama_features - model trained on stage_2_clip features provided by you, random state=42.

The difference remains on other metrics as well, for example, these are the results of MR mAP. image Training is not yet complete, but it is already evident that the results on stage_2 features are much better than the results on stage_2_clip.

My model a bit more powerful than CGDETR, but I want you to focus on the gap between stage2 and stage2_clip. @cg1177

gracikk-ds avatar May 29 '24 09:05 gracikk-ds

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

Thanks a lot! Could you please share a code about how you extract the multi-modal features? I'd like to use the models to extract feature of my own data❤️

tiesanguaixia avatar May 29 '24 09:05 tiesanguaixia

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

I've tried to train CGDETR based model on stage2_clip features that you have released and on stage2 features extracted by myself. image The difference is huge. Have you experimented with stage2 features? Or maybe do you made some changes in CGDETR to make in perform better on stage2_clip features?

@Andy1621 @cg1177

I have not experimented this yet.

tiesanguaixia avatar May 29 '24 09:05 tiesanguaixia

@gracikk-ds I believe it is resonable. When I began to train the grounding tasks, stage_2 model was under training. So stage_2_clip 's initialization weight did not have the best video encoder. Moreover, 7B text encoder was frezon under training stage_2_clip. Both factors make stage_2 model not optimal, but still retilvely great. Instead, stage_2 model used more video-text data to train bert text encoder and video encoder. I find you have tried to use features extracted by stage_2 model for grounding tasks. Could you share your features? We can report the grounding performance of our cg-detr with you features.

cg1177 avatar May 29 '24 12:05 cg1177

Stage2 features @cg1177, try to check this features. I'll wait for results :)

gracikk-ds avatar May 29 '24 13:05 gracikk-ds

Hi! :)

Is it possible for you to release a small demo on how to run the BEATs model? I want to extract audio features too. Or maybe you could give me links to the audio checkpoint that you used during training of stage2 model? Or maybe some useful tips besides this one: The used audio encoder is a 12-layer transformer initialized with BEATs [Chen et al., 2023d] (90M). It takes in audio features, which are 64-dimensional log Mel filterbank spectrograms using a 25ms Hamming window, transformed from 10-second-long clips, padding with zeros?

It will help me a lot) Thank you!

@cg1177, @Andy1621

gracikk-ds avatar May 30 '24 16:05 gracikk-ds

Stage2 features @cg1177, try to check this features. I'll wait for results :)

Hello, we've checked this features and here's the results: 1717468802961 We use the command bash cg_detr/scripts/train.sh. We simply download the features you provided and replace the original features used by CG-DETR with them. Do you need more details?@gracikk-ds

LarryLeeee avatar Jun 04 '24 02:06 LarryLeeee

@LarryLeeee, No thanks, I got what I wanted :)

The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?

gracikk-ds avatar Jun 04 '24 04:06 gracikk-ds

@LarryLeeee, No thanks, I got what I wanted :)

The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?

@gracikk-ds We did not use an audio modality, and you can refer to https://github.com/wjun0830/CGDETR for more details.

LarryLeeee avatar Jun 04 '24 10:06 LarryLeeee

@LarryLeeee, No thanks, I got what I wanted :) The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?

@gracikk-ds We did not use an audio modality, and you can refer to https://github.com/wjun0830/CGDETR for more details.

@LarryLeeee, I meant Intervid2 stage2.

We exploit the correspondence between video and audio, speech, and text to align InternVideo2 to semantics explicitly. In structure, though InternVideo2 has a huge video encoder, its employed audio and text encoders are relatively lightweight. The used audio encoder is a 12-layer transformer initialized with BEATs [Chen et al., 2023d] (90M). It takes in audio features, which are 64-dimensional log Mel filterbank spectrograms using a 25ms Hamming window, transformed from 10-second-long clips, padding with zeros. For the text and speech encoders, we initialize the text encoder and multimodal decoder using Bert-Large [Devlin et al., 2018]. Specifically, we utilize the initial 19 layers of Bert-Large as the text encoder, with the subsequent 5 layers equipped with cross-attention layers serving as the multimodal decoder.

Could you provide link to audio model checkpoint?

gracikk-ds avatar Jun 04 '24 13:06 gracikk-ds

@Andy1621, @cg1177, @LarryLeeee, hi! Any comments about the audio?

gracikk-ds avatar Jun 06 '24 16:06 gracikk-ds

@Andy1621, @cg1177, @LarryLeeee, hi! Any comments about the audio? Hi, I would like to invite another co-author responsible for the audio to answer questions, which will take some time to communicate.

cg1177 avatar Jun 06 '24 17:06 cg1177

@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of ​​how long it will take to communicate with co-author? We need to pick the audio model this week.

gracikk-ds avatar Jun 12 '24 06:06 gracikk-ds

@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of ​​how long it will take to communicate with co-author? We need to pick the audio model this week.

Ok, I urge him at once.

cg1177 avatar Jun 12 '24 07:06 cg1177

@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of ​​how long it will take to communicate with co-author? We need to pick the audio model this week.

Hello @gracikk-ds , sorry for the late reply! We only use the audio encoder to train the InternVideo2-6B model, and the InternVideo2-1B model only contains video and text encoders. Since the 6b checkpoint is still not ready to be open-sourced, we can only provide the weight of the audio encoder of the InternVideo-6B and wonder if it is acceptable. If it helps, we will provide the audio encoder checkpoint before tomorrow.

JustinYuu avatar Jun 12 '24 07:06 JustinYuu

Hi @JustinYuu, thanks for your reply!

It is better than nothing. We look forward to checkpoints :) And if you provide simple demo of how to run the model, that would be perfect! In the case of the QVHighlights dataset, we have videos that are 2 seconds long, should we pad them to 10 seconds?

And are there any chances that the 6b model will be ready for open source by the end of this month?

Thank you!

gracikk-ds avatar Jun 13 '24 06:06 gracikk-ds

Hi @JustinYuu, thanks for your reply!

It is better than nothing. We look forward to checkpoints :) And if you provide simple demo of how to run the model, that would be perfect! In the case of the QVHighlights dataset, we have videos that are 2 seconds long, should we pad them to 10 seconds?

And are there any chances that the 6b model will be ready for open source by the end of this month?

Thank you!

Hi @gracikk-ds , we have provided the audio encoder of the InternVideo2-6B in the following link. You can use this model to extract audio features for your project. For the audio length, we pad audio sequences less than 10 seconds to 10 during training, yet the audio sequence used for training is usually longer than 2 seconds, thus I am not sure whether the padding strategy suits your training data. I suggest that you try both padding to 10 sec and directly put the 2-second vanilla sequence into the model to find out which option is better for your downstream scenarios. For the demo codes, you could simply refer to the BEATs model since our audio encoder is highly similar to it. A simple example is as follows:

from BEATs import BEATs, BEATsConfig
checkpoint = torch.load('yourpath/audio_6b.pth')
raw_checkpoint = torch.load('yourpath/BEATs_iter3+.pt')
cfg = BEATsConfig(raw_checkpoint['cfg'])
audio_model = BEATs(cfg)
audio_model.load_state_dict(checkpoint)
audio_model.eval()
audio_model = audio_model.cuda()
representation = audio_model(fbank)

For the 6b models, we have not decided on the open-source date yet. We will inform you once our model is publicly available. Hoping our model could help your research! :)

JustinYuu avatar Jun 14 '24 06:06 JustinYuu

Thanks a lot guys! :)

gracikk-ds avatar Jun 16 '24 07:06 gracikk-ds

@JustinYuu, one more question :)

Here is my way to prepare features for your audio model. Is it correct?

def prepare_audio_features(audio_tensor: Tensor, sample_rate: int = 16000):
    """
    Prepare audio features by normalizing the input audio tensor and applying a Log Mel spectrogram.

    Args:
        audio_tensor (Tensor): The input tensor containing the raw audio waveform.
        sample_rate (int): The sampling rate of the audio tensor. Defaults to 16000 Hz.

    Returns:
        Tensor: A tensor representing the log Mel spectrogram of the input audio.
    """
    # Define the MelSpectrogram transform
    # it's not evident which values to use for 'win_length', 'n_fft', 'hop_length', 'n_mels' and 'window_fn'
    # In your paper: 
    # win_length=400 - Equivalent to 25ms window size at 16kHz
    # n_fft=??? It could be the next power of two from window length
    # n_fft = 2 ** math.ceil(math.log2(window_length_samples)) = 512
    # n_mels = 64, 
    # window_fn = hamming_window
    # hop_length=???. I'm using 200 as 0.5 overlap
    # But the BEATs article uses different parameter values.
    
    mel_spectrogram = torchaudio.transforms.MelSpectrogram(
        sample_rate=sample_rate,
        win_length=400, 
        n_fft=512,
        hop_length=200,
        n_mels=64,  # Number of Mel bands
        window_fn=torch.hamming_window,
    )

    # Apply the transform to get the Mel spectrogram
    mel_spectrogram = mel_spectrogram(audio_tensor)

    # Convert to log scale
    log_mel_spectrogram = torch.log(mel_spectrogram + EPS)  # Add a small value to avoid log(0)

    # Based on the BEATs paper the acoustic feature is normalized to the mean value of 0 and standard deviation of 0.5
    # But which values should I use for mean and std???
    log_mel_spectrogram = (log_mel_spectrogram - log_mel_spectrogram.mean()) / ( log_mel_spectrogram.std() * 2)

    return log_mel_spectrogram


waveform, sample_rate = torchaudio.load("your_audio_file.wav")

# Apply effects to get the desired sample rate and number of channels
waveform, sample_rate = torchaudio.sox_effects.apply_effects_tensor(
    waveform,
    sample_rate,
    effects=[["rate", "16000"], ["channels", "1"]],
)

fbank = prepare_audio_features(waveform, sample_rate)

And here is the original BEATs preprocessing step:

    def preprocess(
        self,
        source: torch.Tensor,
        fbank_mean: float = 15.41663,
        fbank_std: float = 6.55582,
    ) -> torch.Tensor:
        fbanks = []
        for waveform in source:
            waveform = waveform.unsqueeze(0) * 2**15
            fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10)
            fbanks.append(fbank)
        fbank = torch.stack(fbanks, dim=0)
        fbank = (fbank - fbank_mean) / (2 * fbank_std)
        return fbank

Also I've got question about model forward pass. Here is the forwrd pass of the BEATs model and my output shapes.

    def forward(
        self,
        source: torch.Tensor,
        padding_mask: Optional[torch.Tensor] = None,
        fbank_mean: float = 15.41663,
        fbank_std: float = 6.55582,
    ):
        """Forward pass for the BEATs model.

        Args:
            source (torch.Tensor): Input tensor.
            padding_mask (Optional[torch.Tensor]): Padding mask tensor. Defaults to None.
            fbank_mean (float): Mean value for feature normalization. Defaults to 15.41663.
            fbank_std (float): Standard deviation for feature normalization. Defaults to 6.55582.

        Returns:
            torch.Tensor: Model output tensor.
        """
        #  source.shape = [32, 32000] 16k per second
        # prepare audio feature using original BEATs preaprator gives me output shape: [32, 198, 128]
        fbank = prepare_audio_features_old(source, fbank_mean=fbank_mean, fbank_std=fbank_std)
        #  And I get output shape [32, 64, 161] using my function 
        my_fbank = prepare_audio_features(source)

        if padding_mask is not None:
            padding_mask = self.forward_padding_mask(fbank, padding_mask)

        fbank = fbank.unsqueeze(1) # [32, 192, 128] -> [32, 1, 192, 128]
        features = self.patch_embedding(fbank)  # [32, 1, 192, 128] -> [32, 512, 12, 8]
        features = features.reshape(features.shape[0], features.shape[1], -1)  # [32, 512, 12, 8] -> [32, 512, 96]
        features = features.transpose(1, 2)  # [32, 512, 96]-> [32, 96, 512]
        features = self.layer_norm(features)

        if padding_mask is not None:
            padding_mask = self.forward_padding_mask(features, padding_mask)

        features = self.post_extract_proj(features)  # [32, 96, 512] -> [32, 96, 768]
        x = self.dropout_input(features)
        x, _ = self.encoder(x, padding_mask=padding_mask)  # [32, 96, 768] -> [32, 96, 768]
        return x, padding_mask

And here is the question, what should I do next with emb of shape [32, 96, 768] to get [32, 768]?

gracikk-ds avatar Jun 20 '24 10:06 gracikk-ds

@cg1177, could you summon @JustinYuu one more time? :DD

gracikk-ds avatar Jun 21 '24 06:06 gracikk-ds

@cg1177, could you summon @JustinYuu one more time? :DD

OK

takfate avatar Jun 21 '24 09:06 takfate