scenic icon indicating copy to clipboard operation
scenic copied to clipboard

Inference Pipeline for Vid2Seq

Open mmaaz60 opened this issue 1 year ago • 13 comments

Hi @antoyang ,

Thanks for your sharing your great work on Vid2Seq. Is it possible to share/release a simple inference script to run inference on a given video using the pretrained model? That would be really helpful.

Thanks

mmaaz60 avatar May 13 '23 19:05 mmaaz60

Hi, I do not have the bandwidth to work on this now (I may do it over the summer), but let me know if you face any issue when implementing such a script.

antoyang avatar May 14 '23 04:05 antoyang

Thank You @antoyang,

I am trying to run inference of Vid2Seq on some private videos. Specifically I want to get the video features after video temporal encoder. The following is what I wrote.

# Create model
model = vit.Encoder(mlp_dim=2048, num_layers=12, num_heads=12,
                positional_embedding='learned_1d', dropout_rate=0.0,
                attention_dropout_rate=0.0, stochastic_depth=0.0)

# Restore checkpoints
restored = checkpoints.restore_checkpoint(checkpoints_path, target=None)
params = restored['optimizer']['target']
visual_encoder_params = params["encoder"]["visual_encoder"]
    
@jit
def fast_apply(params, input_data):
    return model.apply({"params": params}, input_data, mutable=False)
output_batch = fast_apply(visual_encoder_params, input_batch)

Where input batch contains the CLIP-14/L features. (100, 768)

I wanted to check if my implementation is correct or I am missing any details. I am new at Jax and your comments would be really helpful in this regard. Thanks

mmaaz60 avatar May 14 '23 11:05 mmaaz60

@mmaaz60 Did you ever find out if your inference pipeline worked? If so, could you provide it? Thanks!

Andrew-Zhang avatar May 23 '23 17:05 Andrew-Zhang

Thank You @antoyang,

I am trying to run inference of Vid2Seq on some private videos. Specifically I want to get the video features after video temporal encoder. The following is what I wrote.

# Create model
model = vit.Encoder(mlp_dim=2048, num_layers=12, num_heads=12,
                positional_embedding='learned_1d', dropout_rate=0.0,
                attention_dropout_rate=0.0, stochastic_depth=0.0)

# Restore checkpoints
restored = checkpoints.restore_checkpoint(checkpoints_path, target=None)
params = restored['optimizer']['target']
visual_encoder_params = params["encoder"]["visual_encoder"]
    
@jit
def fast_apply(params, input_data):
    return model.apply({"params": params}, input_data, mutable=False)
output_batch = fast_apply(visual_encoder_params, input_batch)

Where input batch contains the CLIP-14/L features. (100, 768)

I wanted to check if my implementation is correct or I am missing any details. I am new at Jax and your comments would be really helpful in this regard. Thanks

Hi @Andrew-Zhang,

This is what I have, not 100% sure if its the right way to do it. However, it seems to be working.

mmaaz60 avatar May 24 '23 17:05 mmaaz60

I want to just create a caption of a one minute video for teaching & learning purpose. But I am new in coding such a massive architecture. Can anyone please provide end to end inference code(Read video, load pre-trained model, generate caption). This will be a great help to me. I tried @mmaaz60 but not able to get anything. Thanks in advance.

Ajeet-kumar1 avatar Jun 21 '23 09:06 Ajeet-kumar1

@mmaaz60 Are you sure you're able to make it work? Am getting AssertionError: assert inputs.ndim == 3 # Shape is [batch, len, emb].

I use SentenceTrannsformer from HuggingFace to extract features for every frame of my video, (1 frame per second), then get a tensor of size (batch_size*768)

When I execute fast_apply I get the above assertion error. Seems like out input_batch needs to be a 3d array?

debasishkanhar avatar Jun 21 '23 16:06 debasishkanhar

So, on further digging looks like the Vid2Seq model's vidual encoder layer requires a (N100768) dim input_batch, where N is number of frames. 768 is output dim of Clip V14 model. Can you help me what is 100 in the dimension? @mmaaz60 if I understand correctly, you're inputting single frame of shape (1100768). Can you help where did you get 100*768 dim features as input for Vid2Seq model? Thanks. @antoyang if you can pitch in will be helpful. Tried going through the pdf paper but couldn't find out wherethe 100th dim value is coming from

debasishkanhar avatar Jun 23 '23 06:06 debasishkanhar

The 100 corresponds to the number of frames used by the model, as explained in the implementation details section of the paper.

antoyang avatar Jun 24 '23 17:06 antoyang

Hey @antoyang, thanks for sharing the code to this project!

Bumping this up - is there any chance you have the bandwidth to share a simple inference script to vid2seq? Or provide some guidance on how to write one?

kerenganon avatar Jul 15 '23 15:07 kerenganon

@mmaaz60 would it be possible to share the whole script?

I also want to use Vid2Seq by loading the anet-captions checkpoint as stated in the Readme under the vid2seq directory for creating some captions for videos.

In your code example, you are encoding the videos using the Vision Transformer Encoder. But do you actually use the encoded batch as an input for creating captions?

Would be extremely helpful if someone could provide a working example how to create captions from a raw mp4 file.

thurnbauermatthi avatar Jul 18 '23 12:07 thurnbauermatthi

Hi @mmaaz60, would you please be able to share your working inference script? It would be really helpful, thanks!

vishaal27 avatar Aug 11 '23 14:08 vishaal27

You may have a look at the Vid2Seq PyTorch implementation (with a few differences explained in the readme) included here: https://github.com/antoyang/VidChapters. It also includes an example of inference script.

antoyang avatar Sep 26 '23 20:09 antoyang

@antoyang Thanks for your sharing your great work on Vid2Seq. Can you indicate the specific versions number of flax 、 jax and jaxlib? We have encountered some dependency conflicts.Thank you very much

yinyantao avatar Jun 17 '24 07:06 yinyantao