Ego4d icon indicating copy to clipboard operation
Ego4d copied to clipboard

Canonical clips details

Open sammy-su opened this issue 2 years ago • 4 comments

I wonder when will the details for canonical clips be available? In particular, how were they extracted from canonical videos, and whether there's any transcoding in the process? We encountered some inconsistency when processing the two data and would like to figure out the cause.

sammy-su avatar Aug 12 '22 18:08 sammy-su

Hey @sammy-su, I'll update the wiki shortly.

In the meantime: the Canonical Clips were generated by transcoding the canonical videos using the vp9 codec and crf 18. Specifically, we decode each frame into an ndarray, then encode it into a video stream in the output container - all using PyAV.

The transcode does mean the canonical clips are not byte-wise identical to their section of the canonical videos, but it's necessary since the clip start/end points don't line up with keyframes so we must encode new ones.

What inconsistencies are you seeing? We can certainly dive into them and see what's going on.

devanshk avatar Aug 12 '22 22:08 devanshk

We tried to extract the audio data, and it turns out that the difference between full_scale and clip is not neglectable.

For example, when I compare the following audio segments from clip_uid cae37cbc-7ff0-40ea-b3a4-6e6a551f01ab:

  1. clip 00:18~00:21
  2. clip 00:21~00:24
  3. full_scale 00:18~00:21

the 2-norm between 1 and 3 is larger than 1 and 2. While 2-norm might not be a good way to compare audio, I still expect 1 and 3 to be closer given that they should differ only by a small offset. Therefore, I wonder if the audio is also transcoded?

sammy-su avatar Aug 16 '22 00:08 sammy-su

We do transcode audio in the same way. For audio frames, we use the AAC format which has a good explanation here.

The difference does seem a bit strange, two questions:

  1. This clip segment starts at 592s in its parent. Are you comparing 00:18 - 00:21 in the clip to 9:52 - 9:55 in the parent?
  2. Do you have a notebook or something to help us replicate the 2-norm differences?

devanshk avatar Aug 19 '22 19:08 devanshk

  1. Yes, the audio from the full_scale video is based on the parent time. I actually manually check the audio and couldn't tell their difference.
  2. I first extract the audio for the entire video using ffmpeg using copy option. The audio for the clip is then extracted using
from scipy.io import wavfile

with tf.io.gfile.GFile(path, 'rb') as f:
    sample_rate, wav = wavfile.read(f)
wav = np.asfarray(wav, dtype=np.float32)

audio_start = int(sample_rate * start_frame / 30.0)
audio_end = int(sample_rate * end_frame / 30.0 + 1)
audio = wav[audio_start:audio_end]

After extracting the audio for each clip, the values are compared using

# delta clip / full_scale
delta = tf.signal.rfft(clip_audio[:160000, 0]) - tf.signal.rfft(full_scale_audio[:160000, 0])
print(np.linalg.norm(delta))

# delta full_scale / full_scale
delta = tf.signal.rfft(full_scale_audio[:160000, 0]) - tf.signal.rfft(full_scale_audio[160000:320000, 0])
print(np.linalg.norm(delta))
  1. Another evidence showing the difference is that when we tried to use audio data for the Object-state-change classification task, we observe that:
Train Validation Accuracy
clip clip ~77%
clip full_scale ~66%
full_scale full_scale ~70%
full_scale clip ~70%

We believe clip data introduces information leak because clip video only contains the positive samples.

sammy-su avatar Aug 23 '22 22:08 sammy-su

Hi all, we have documentation for the canonical clips located here: https://ego4d-data.org/docs/data/videos/#canonical-clips

Apologies on the delay.

miguelmartin75 avatar Feb 09 '23 18:02 miguelmartin75