joonson comments

Results 14 comments of


                                            joonson

Evaluation on list save

This part of the pipeline is trying to evaluate the quality of audio embeddings for the **downstream task** of speaker recognition. The audio-visual evaluation can be done using the validation...

Negative audio samples for M way matching

The negative samples are the features at different timesteps, within the same batch. In `output` in this line: https://github.com/joonson/syncnet_trainer/blob/15e5cfcbe150da8ed5c04cfe74a011319ae60d06/SyncNetDist.py#L50 all non-diagonal elements are negatives.

Negative audio samples for M way matching

> Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context...

label

The label should be the matching frame, i.e. along the diagonal. See https://ieeexplore.ieee.org/document/9067055

Do we need to download VGGFACE and VoxCeleb in advance?

You only need VoxCeleb2.

How to use this repo?

No, the output is not sync-corrected. It just gives you an offset and active speaker detection labels.

how to remove temporal lags according to the offset?

I think it is the same discussion as https://github.com/joonson/syncnet_python/issues/53

Different results between demo_syncnet.py and run_syncnet.py

`run_pipeline.py` runs the face tracking script and re-encodes the video using `ffmpeg`. I suspect that the re-encoding process is introducing an offset.

Video conversion and synchronizion issue

`ffmpeg` should resample the video to minimize the synchronization error.

Voxceleb2 header?

Offset : Audio-to-video offset in # frames (using SyncNet) FV Conf : Face Verification confidence (using VGGFace2) ASD Conf : Active Speaker Detection confidence (using SyncNet)