av_hubert
av_hubert copied to clipboard
How to extract audio-visual features?
Hi, thank you for the work and the colab. In the colab, the following code snippet shows how to extract visual features.
def extract_visual_feature(video_path, ckpt_path, user_dir, is_finetune_ckpt=False):
utils.import_user_module(Namespace(user_dir=user_dir))
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
transform = avhubert_utils.Compose([
avhubert_utils.Normalize(0.0, 255.0),
avhubert_utils.CenterCrop((task.cfg.image_crop_size, task.cfg.image_crop_size)),
avhubert_utils.Normalize(task.cfg.image_mean, task.cfg.image_std)])
frames = avhubert_utils.load_video(video_path)
print(f"Load video {video_path}: shape {frames.shape}")
frames = transform(frames)
print(f"Center crop video to: {frames.shape}")
frames = torch.FloatTensor(frames).unsqueeze(dim=0).unsqueeze(dim=0).cuda()
model = models[0]
if hasattr(models[0], 'decoder'):
print(f"Checkpoint: fine-tuned")
model = models[0].encoder.w2v_model
else:
print(f"Checkpoint: pre-trained w/o fine-tuning")
model.cuda()
model.eval()
with torch.no_grad():
# Specify output_layer if you want to extract feature of an intermediate layer
feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)
feature = feature.squeeze(dim=0)
print(f"Video feature shape: {feature.shape}")
return feature
I wonder how I can extract audio-visual features?
can you please give an example? or specifically what to feed into the source['audio']
? Is it a normalized [-1,1] waveform? or othter sprectral features?
Thank you.
Hi,
source['audio']
is the log filterbank, see here. Also it should be normalized like here when the task.cfg.normalize=true
, which I believe is the case for all the models we release. Besides, source['audio']
should be of the same sequence length as source['video']
before feeding into the model, as we assume the audio and video are synchronized.
Hello,
I am using your script to extract audio visual features
After extracting log filterbank using python_speech-features
of shape (96,26) and frame shape is (96,88,88)
It is throwing following error from hubert.py", line 327, in forward
x= self.proj(x.transpose(1, 2)) numpy.AxisError: axis 2 is out of bounds for array of dimension 2
by using following command
feature, _ = model.extract_finetune(source={'video': frames, 'audio': audio_feat}, padding_mask=None, output_layer=None)