youtube icon indicating copy to clipboard operation
youtube copied to clipboard

Cache subtitles on S3 as well

Open benoit74 opened this issue 1 year ago • 3 comments

Currently only video thumbnails and video themselves are cached on S3.

This has the drawback that when an IP has been blacklisted from yt-dlp usage, the recipe fails to produce the ZIM even if all API calls have succeeded, because we use yt-dlp to download the subtitles.

Caching the subtitles on S3 would allow to create the ZIM.

benoit74 avatar Jul 23 '24 14:07 benoit74

Seems a good idea but do subtitles are served properly using etags?

kelson42 avatar Jul 23 '24 15:07 kelson42

Currently we're using yt-dlp to download subtitles and etags are not provided for subtitles. The response is in the following format:

"requested_subtitles": {
  "en": {
	  "ext": "vtt",
	  "url": "https://www.youtube.com/api/timedtext?v=DYvYGQHYScc&ei=rzKqZouKCqfWz7sPiu_E2Qw&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1722455327&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=D55586A99B8028F2565AFE1F76F3F55D8BE2ECA6.E032AF517474302C806EE8A02C6CDC914CD903B9&key=yt8&lang=en&fmt=vtt",
	  "name": "English"
  }
},

However the YouTube Data API (https://developers.google.com/youtube/v3/docs/captions#resource-representation) does provide etags for captions.

dan-niles avatar Jul 31 '24 13:07 dan-niles

@benoit74 and I discussed the possibility of hashing the url of each subtitle provided by yt-dlp and using it as an etag. However, it seems that this URL changes every time it is fetched by yt-dlp.

I tried manually editing the subtitles of this video on the openZIM_testing YouTube channel to observe how the URL is affected. However, it appears that YouTube fetches the latest subtitles internally, and the query parameters in the URL don't seem to have an impact.

dan-niles avatar Jul 31 '24 16:07 dan-niles