End-to-End-VAD
End-to-End-VAD copied to clipboard
Difference in published and generated results
Hello Team,
Thanks for providing the repo.
I have replicated this repo step by step as per the details mentioned in the paper and in this repo. First I have trained both streams separately and then used their pretrained weights to train multimodal architecture.
From the experiments, I can see that there is mismatch in the generated result (Accuracy ~82%) and the published result (Accuracy ~91%). Can I get the guidance from the team to achieve the same results?
Thanks in advance.