joonson

Results 14 comments of joonson

This part of the pipeline is trying to evaluate the quality of audio embeddings for the **downstream task** of speaker recognition. The audio-visual evaluation can be done using the validation...

The negative samples are the features at different timesteps, within the same batch. In `output` in this line: https://github.com/joonson/syncnet_trainer/blob/15e5cfcbe150da8ed5c04cfe74a011319ae60d06/SyncNetDist.py#L50 all non-diagonal elements are negatives.

> Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context...

The label should be the matching frame, i.e. along the diagonal. See https://ieeexplore.ieee.org/document/9067055

No, the output is not sync-corrected. It just gives you an offset and active speaker detection labels.

I think it is the same discussion as https://github.com/joonson/syncnet_python/issues/53

`run_pipeline.py` runs the face tracking script and re-encodes the video using `ffmpeg`. I suspect that the re-encoding process is introducing an offset.

`ffmpeg` should resample the video to minimize the synchronization error.

Offset : Audio-to-video offset in # frames (using SyncNet) FV Conf : Face Verification confidence (using VGGFace2) ASD Conf : Active Speaker Detection confidence (using SyncNet)