scenic
scenic copied to clipboard
Inference Pipeline for Vid2Seq
Hi @antoyang ,
Thanks for your sharing your great work on Vid2Seq. Is it possible to share/release a simple inference script to run inference on a given video using the pretrained model? That would be really helpful.
Thanks
Hi, I do not have the bandwidth to work on this now (I may do it over the summer), but let me know if you face any issue when implementing such a script.
Thank You @antoyang,
I am trying to run inference of Vid2Seq on some private videos. Specifically I want to get the video features after video temporal encoder. The following is what I wrote.
# Create model
model = vit.Encoder(mlp_dim=2048, num_layers=12, num_heads=12,
positional_embedding='learned_1d', dropout_rate=0.0,
attention_dropout_rate=0.0, stochastic_depth=0.0)
# Restore checkpoints
restored = checkpoints.restore_checkpoint(checkpoints_path, target=None)
params = restored['optimizer']['target']
visual_encoder_params = params["encoder"]["visual_encoder"]
@jit
def fast_apply(params, input_data):
return model.apply({"params": params}, input_data, mutable=False)
output_batch = fast_apply(visual_encoder_params, input_batch)
Where input batch contains the CLIP-14/L features. (100, 768)
I wanted to check if my implementation is correct or I am missing any details. I am new at Jax and your comments would be really helpful in this regard. Thanks
@mmaaz60 Did you ever find out if your inference pipeline worked? If so, could you provide it? Thanks!
Thank You @antoyang,
I am trying to run inference of Vid2Seq on some private videos. Specifically I want to get the video features after video temporal encoder. The following is what I wrote.
# Create model model = vit.Encoder(mlp_dim=2048, num_layers=12, num_heads=12, positional_embedding='learned_1d', dropout_rate=0.0, attention_dropout_rate=0.0, stochastic_depth=0.0) # Restore checkpoints restored = checkpoints.restore_checkpoint(checkpoints_path, target=None) params = restored['optimizer']['target'] visual_encoder_params = params["encoder"]["visual_encoder"] @jit def fast_apply(params, input_data): return model.apply({"params": params}, input_data, mutable=False)
output_batch = fast_apply(visual_encoder_params, input_batch) Where input batch contains the CLIP-14/L features. (100, 768)
I wanted to check if my implementation is correct or I am missing any details. I am new at Jax and your comments would be really helpful in this regard. Thanks
Hi @Andrew-Zhang,
This is what I have, not 100% sure if its the right way to do it. However, it seems to be working.
I want to just create a caption of a one minute video for teaching & learning purpose. But I am new in coding such a massive architecture. Can anyone please provide end to end inference code(Read video, load pre-trained model, generate caption). This will be a great help to me. I tried @mmaaz60 but not able to get anything. Thanks in advance.
@mmaaz60 Are you sure you're able to make it work? Am getting AssertionError: assert inputs.ndim == 3 # Shape is
[batch, len, emb].
I use SentenceTrannsformer
from HuggingFace to extract features for every frame of my video, (1 frame per second), then get a tensor of size (batch_size*768)
When I execute fast_apply
I get the above assertion error. Seems like out input_batch needs to be a 3d array?
So, on further digging looks like the Vid2Seq model's vidual encoder layer requires a (N100768) dim input_batch, where N is number of frames. 768 is output dim of Clip V14 model. Can you help me what is 100 in the dimension? @mmaaz60 if I understand correctly, you're inputting single frame of shape (1100768). Can you help where did you get 100*768 dim features as input for Vid2Seq model? Thanks. @antoyang if you can pitch in will be helpful. Tried going through the pdf paper but couldn't find out wherethe 100th dim value is coming from
The 100 corresponds to the number of frames used by the model, as explained in the implementation details section of the paper.
Hey @antoyang, thanks for sharing the code to this project!
Bumping this up - is there any chance you have the bandwidth to share a simple inference script to vid2seq? Or provide some guidance on how to write one?
@mmaaz60 would it be possible to share the whole script?
I also want to use Vid2Seq by loading the anet-captions checkpoint as stated in the Readme under the vid2seq directory for creating some captions for videos.
In your code example, you are encoding the videos using the Vision Transformer Encoder. But do you actually use the encoded batch as an input for creating captions?
Would be extremely helpful if someone could provide a working example how to create captions from a raw mp4 file.
Hi @mmaaz60, would you please be able to share your working inference script? It would be really helpful, thanks!
You may have a look at the Vid2Seq PyTorch implementation (with a few differences explained in the readme) included here: https://github.com/antoyang/VidChapters. It also includes an example of inference script.
@antoyang Thanks for your sharing your great work on Vid2Seq. Can you indicate the specific versions number of flax 、 jax and jaxlib? We have encountered some dependency conflicts.Thank you very much