Visual_Speech_Recognition_for_Multiple_Languages
Visual_Speech_Recognition_for_Multiple_Languages copied to clipboard
pre-trained VSR / ASR model
As mentioned in S3, the pre-trained models are always trained on the same data as the full model (yet I do not know the pre-training details), and specially the pre-trained VSR model has exactly the same architecture as the full one. So, I wonder why the supervised signals (e.g., intermediate representations) from pre-trained VSR still make sense. Could you give in-depth explanations?