joonson
joonson
This part of the pipeline is trying to evaluate the quality of audio embeddings for the **downstream task** of speaker recognition. The audio-visual evaluation can be done using the validation...
The negative samples are the features at different timesteps, within the same batch. In `output` in this line: https://github.com/joonson/syncnet_trainer/blob/15e5cfcbe150da8ed5c04cfe74a011319ae60d06/SyncNetDist.py#L50 all non-diagonal elements are negatives.
> Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context...
The label should be the matching frame, i.e. along the diagonal. See https://ieeexplore.ieee.org/document/9067055
You only need VoxCeleb2.
No, the output is not sync-corrected. It just gives you an offset and active speaker detection labels.
I think it is the same discussion as https://github.com/joonson/syncnet_python/issues/53
`run_pipeline.py` runs the face tracking script and re-encodes the video using `ffmpeg`. I suspect that the re-encoding process is introducing an offset.
`ffmpeg` should resample the video to minimize the synchronization error.
Offset : Audio-to-video offset in # frames (using SyncNet) FV Conf : Face Verification confidence (using VGGFace2) ASD Conf : Active Speaker Detection confidence (using SyncNet)