David Gimeno-Gómez comments

Results 11 comments of


                                            David Gimeno-Gómez

ASR DECODING: maxlenratio & minlenratio

It is true, I forgot to include the error trace!! Please find here an example of the raised error when setting ```maxlenratio``` to 1.2: ``` Traceback (most recent call last):...

ASR DECODING: maxlenratio & minlenratio

Thanks for sharing that tutorial, it helped me understand how to set the ```maxlenratio``` and ```minlenratio``` in a more proper way depending on whether the hypothesis I am getting suffered...

Explanation of landmark's heatmap output

@Developer1881 According to my intuition, at some moment of the model forward the cropped face is embedded in a 64x64 latent representation, and then a heatmap is predicted for each...

Extraction of features with AV HuBERT

Although it can be a bit late, that 104 refers to the number of features composing the audio input tensor. In order to make the model work with audio stream...

Extraction of features with AV HuBERT

Diving into my code scripts, I found that I did this padding trick: ``` diff = len(audio_feats) - len(video_frames) if diff < 0: audio_feats = np.concatenate([ audio_feats, np.zeros([-diff, audio_feats.shape[-1]], dtype=audio_feats.dtype),...

Extraction of features with AV HuBERT

I am glad it worked :) However, I've never tried to extract features from a specific intermediate layer of AV-HuBERT, so in this case I cannot help you. I guess...

Extraction of features with AV HuBERT

In the field of ASR, the audio raw waveform is processed to extract the well-established Mel Frequency Cepstral Coefficients (MFCCs). I recommend you to read more about how these audio...

Finetuning Models for Visual Speech Recognition

Thank so much for your reply, it solved the problem! Nonetheless, I have a new doubt :) · First of all, I will explain you my purpose. I am working...

Finetuning Models for Visual Speech Recognition

Before exploring the AV-HuBERTsystem with my own database, I wanted to see if I can reach similar performance with the LRS3 database. The point is that I had already prepared...

Finetuning Models for Visual Speech Recognition

Thank you for clarifying these aspects!! Now, i would like to ask you how I can fine-tune this pre-trained model. I run this command: `fairseq-hydra-train --config-dir ${PWD}/conf/finetune/ --config-name base_vox_433h.yaml task.data=${PWD}/data/LRS3-TED/speaker-independent/...