video2dataset
video2dataset copied to clipboard
YouTube metadata is not saved
Issue
When using video2dataset (1.3.0) to download youtube videos i've set the following entry in the config to retrieve meta data:
reading:
yt_args:
download_size: 360
download_audio_rate: 44100
yt_metadata_args:
writesubtitles: 'all'
subtitleslangs: ['en', 'de', 'es', 'fr', 'it', 'nl', 'pl', 'ru']
writeautomaticsub: True
get_info: True
timeout: 60
sampler: null
But in the resulting json files the entry "yt_meta_dict": {}, is empty even though get_info: True in the config.
How to reproduce
For example this link: https://www.youtube.com/embed/JFUsP1coIKM When i download that with yt-dlp:
yt-dlp -N 2 \
--write-subs --convert-subs srt \
--write-info-json --embed-subs --embed-chapters --embed-metadata \
--no-progress -q \
--format 'b[height<=360][ext=mp4]' \
--output './demo.mp4' \
https://www.youtube.com/embed/JFUsP1coIKM
I get youtube meta data like "categories": ["Entertainment"], "tags": ["Deutsche", "Welle", "Made", "in", "Germany", "Bio", "Lettland", "Getreide"]
But with video2dataset it looks like this:
"caption": "\"Volles Korn voran\" 28. November 2008 Beitrag \u00fcber den \u00f6kologischen Teil des Ackerbaus von german",
"url": "https://www.youtube.com/embed/JFUsP1coIKM",
"key": "0000000",
"status": "success",
"error_message": null,
"yt_meta_dict": {},
"video_metadata": {...
Are you getting empty yt_meta_dict for just some videos or all of them?
What I am is seeing, that for every 300 videos I seem to get roughly 100 videos with yt_meta_dict populated and 200 videos with yt_meta_dict = {}, which is quite strange.
What exactly does ignoring errors in yt_dlp mean? Even if you have retries, it gives up on the first try?
https://github.com/iejMac/video2dataset/blob/28e7d1c851a2298f3a75375f6e324950405987e7/video2dataset/data_reader.py#L74
Other yt_dlp codepaths don't seem to set this.
Ahh! Now I understand what happens: with multiple clips, only the first one (_00000.json) will have yt_meta_dict populated, not the following clips.
It seems this was a change introduced by clipping subsampler refactoring (#275), did it behave differently in v1.2.0?
https://github.com/iejMac/video2dataset/blob/28e7d1c851a2298f3a75375f6e324950405987e7/video2dataset/subsamplers/clipping_subsampler.py#L181-L183
I am not sure if this is a good idea. Depending on your processing pipeline, you might want to have the same metadata available on all the clips.
I agree duplicating the metadata makes more sense especially given the size of the data
On Thu, Mar 7, 2024, 12:33 PM Henrik Ahlgren @.***> wrote:
Ahh! Now I understand what happens: with multiple clips, only the first one (_00000.json) will have yt_meta_dict populated, not the following clips.
It seems this was a change introduced by clipping subsampler refactoring ( #275 https://github.com/iejMac/video2dataset/pull/275), did it behave differently in v1.2.0?
https://github.com/iejMac/video2dataset/blob/28e7d1c851a2298f3a75375f6e324950405987e7/video2dataset/subsamplers/clipping_subsampler.py#L181-L183
I am not sure if this is a good idea. Depending on your processing pipeline, you might want to have the same metadata available on all the clips.
— Reply to this email directly, view it on GitHub https://github.com/iejMac/video2dataset/issues/319#issuecomment-1983319110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QGWBOFA5L5DCYENX3YXBGB7AVCNFSM6AAAAABDRVKHRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGMYTSMJRGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>