scenic Inference Pipeline for Vid2Seq

Hi @antoyang ,

Thanks for your sharing your great work on Vid2Seq. Is it possible to share/release a simple inference script to run inference on a given video using the pretrained model? That would be really helpful.

Thanks

May 13 '23 19:05 mmaaz60

Hi, I do not have the bandwidth to work on this now (I may do it over the summer), but let me know if you face any issue when implementing such a script.

May 14 '23 04:05 antoyang

Thank You @antoyang,

I am trying to run inference of Vid2Seq on some private videos. Specifically I want to get the video features after video temporal encoder. The following is what I wrote.

# Create model
model = vit.Encoder(mlp_dim=2048, num_layers=12, num_heads=12,
                positional_embedding='learned_1d', dropout_rate=0.0,
                attention_dropout_rate=0.0, stochastic_depth=0.0)

# Restore checkpoints
restored = checkpoints.restore_checkpoint(checkpoints_path, target=None)
params = restored['optimizer']['target']
visual_encoder_params = params["encoder"]["visual_encoder"]
    
@jit
def fast_apply(params, input_data):
    return model.apply({"params": params}, input_data, mutable=False)

output_batch = fast_apply(visual_encoder_params, input_batch)

Where input batch contains the CLIP-14/L features. (100, 768)

I wanted to check if my implementation is correct or I am missing any details. I am new at Jax and your comments would be really helpful in this regard. Thanks

May 14 '23 11:05 mmaaz60

@mmaaz60 Did you ever find out if your inference pipeline worked? If so, could you provide it? Thanks!

May 23 '23 17:05 Andrew-Zhang

Thank You @antoyang,

I am trying to run inference of Vid2Seq on some private videos. Specifically I want to get the video features after video temporal encoder. The following is what I wrote.
# Create model
model = vit.Encoder(mlp_dim=2048, num_layers=12, num_heads=12,
                positional_embedding='learned_1d', dropout_rate=0.0,
                attention_dropout_rate=0.0, stochastic_depth=0.0)

# Restore checkpoints
restored = checkpoints.restore_checkpoint(checkpoints_path, target=None)
params = restored['optimizer']['target']
visual_encoder_params = params["encoder"]["visual_encoder"]
    
@jit
def fast_apply(params, input_data):
    return model.apply({"params": params}, input_data, mutable=False)
output_batch = fast_apply(visual_encoder_params, input_batch)

Where input batch contains the CLIP-14/L features. (100, 768)
I wanted to check if my implementation is correct or I am missing any details. I am new at Jax and your comments would be really helpful in this regard. Thanks

Hi @Andrew-Zhang,

This is what I have, not 100% sure if its the right way to do it. However, it seems to be working.

May 24 '23 17:05 mmaaz60

I want to just create a caption of a one minute video for teaching & learning purpose. But I am new in coding such a massive architecture. Can anyone please provide end to end inference code(Read video, load pre-trained model, generate caption). This will be a great help to me. I tried @mmaaz60 but not able to get anything. Thanks in advance.

Jun 21 '23 09:06 Ajeet-kumar1

@mmaaz60 Are you sure you're able to make it work? Am getting AssertionError: assert inputs.ndim == 3 # Shape is [batch, len, emb].

I use SentenceTrannsformer from HuggingFace to extract features for every frame of my video, (1 frame per second), then get a tensor of size (batch_size*768)

When I execute fast_apply I get the above assertion error. Seems like out input_batch needs to be a 3d array?

Jun 21 '23 16:06 debasishkanhar

So, on further digging looks like the Vid2Seq model's vidual encoder layer requires a (N100768) dim input_batch, where N is number of frames. 768 is output dim of Clip V14 model. Can you help me what is 100 in the dimension? @mmaaz60 if I understand correctly, you're inputting single frame of shape (1100768). Can you help where did you get 100*768 dim features as input for Vid2Seq model? Thanks. @antoyang if you can pitch in will be helpful. Tried going through the pdf paper but couldn't find out wherethe 100th dim value is coming from

Jun 23 '23 06:06 debasishkanhar

The 100 corresponds to the number of frames used by the model, as explained in the implementation details section of the paper.

Jun 24 '23 17:06 antoyang

Hey @antoyang, thanks for sharing the code to this project!

Bumping this up - is there any chance you have the bandwidth to share a simple inference script to vid2seq? Or provide some guidance on how to write one?

Jul 15 '23 15:07 kerenganon

@mmaaz60 would it be possible to share the whole script?

I also want to use Vid2Seq by loading the anet-captions checkpoint as stated in the Readme under the vid2seq directory for creating some captions for videos.

In your code example, you are encoding the videos using the Vision Transformer Encoder. But do you actually use the encoded batch as an input for creating captions?

Would be extremely helpful if someone could provide a working example how to create captions from a raw mp4 file.

Jul 18 '23 12:07 thurnbauermatthi

Hi @mmaaz60, would you please be able to share your working inference script? It would be really helpful, thanks!

Aug 11 '23 14:08 vishaal27

You may have a look at the Vid2Seq PyTorch implementation (with a few differences explained in the readme) included here: https://github.com/antoyang/VidChapters. It also includes an example of inference script.

Sep 26 '23 20:09 antoyang

@antoyang Thanks for your sharing your great work on Vid2Seq. Can you indicate the specific versions number of flax 、 jax and jaxlib? We have encountered some dependency conflicts.Thank you very much

Jun 17 '24 07:06 yinyantao

scenic scenic copied to clipboard

Inference Pipeline for Vid2Seq

scenic
scenic copied to clipboard