youtube icon indicating copy to clipboard operation
youtube copied to clipboard

Cache subtitles in S3 storage

Open dan-niles opened this issue 1 year ago • 4 comments

This PR modifies the download_subtitles method to cache subtitles in S3. The modified method now works as follows:

  • Fetch requested subtitle keys using yt-dlp (e.g., en, fr, de) and store them in requested_subtitle_keys.
  • Iterate through the requested_subtitle_keys and attempt to download each subtitle file from the S3 cache. If a file is - successfully downloaded from the S3 cache, remove the corresponding key from requested_subtitle_keys.
  • For the remaining keys in requested_subtitle_keys, download the subtitles using yt-dlp.
  • Upload the newly downloaded subtitles to the S3 cache.
  • Save the information about the fetched subtitles in a local cache as a JSON file for future use.
  • Add the downloaded subtitles to the ZIM file.

Close #277

dan-niles avatar Aug 04 '24 07:08 dan-niles

Codecov Report

Attention: Patch coverage is 0% with 34 lines in your changes missing coverage. Please review.

Project coverage is 1.50%. Comparing base (60b85b5) to head (d34a734).

Files Patch % Lines
scraper/src/youtube2zim/scraper.py 0.00% 34 Missing :warning:
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #287      +/-   ##
========================================
- Coverage   1.54%   1.50%   -0.05%     
========================================
  Files         11      11              
  Lines       1102    1132      +30     
  Branches     162     170       +8     
========================================
  Hits          17      17              
- Misses      1085    1115      +30     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 04 '24 07:08 codecov[bot]

Original goal I expressed in the issue was to avoid being blocked by yt-dlp ban when subtitles did not changed.

The reason I used yt-dlp to get the list of subtitles is that the scraper doesn't know which language subtitles need to be downloaded. Without the --all-subtitles flag, only the default available subtitles for the video are downloaded. With the --all-subtitles flag, auto-generated subtitles are included as well.

To avoid calling yt-dlp entirely in this scenario, we could save two zipped files in the S3 cache:

  1. subtitles/{video_id}/default.zip - containing only the default subtitles.
  2. subtitles/{video_id}/all.zip - containing all subtitles when --all-subtitles is passed.

WDYT?

dan-niles avatar Aug 05 '24 11:08 dan-niles

This seems a potential improvement, but does it really help the scraper run faster? Because it has the drawback that we do not know when we should invalidate these to update them.

benoit74 avatar Aug 05 '24 11:08 benoit74

Let's pause this issue/PR to let me reflect a bit on this

benoit74 avatar Aug 05 '24 12:08 benoit74