Cache subtitles in S3 storage
This PR modifies the download_subtitles method to cache subtitles in S3. The modified method now works as follows:
- Fetch requested subtitle keys using
yt-dlp(e.g.,en,fr,de) and store them inrequested_subtitle_keys. - Iterate through the
requested_subtitle_keysand attempt to download each subtitle file from the S3 cache. If a file is - successfully downloaded from the S3 cache, remove the corresponding key fromrequested_subtitle_keys. - For the remaining keys in
requested_subtitle_keys, download the subtitles usingyt-dlp. - Upload the newly downloaded subtitles to the S3 cache.
- Save the information about the fetched subtitles in a local cache as a JSON file for future use.
- Add the downloaded subtitles to the ZIM file.
Close #277
Codecov Report
Attention: Patch coverage is 0% with 34 lines in your changes missing coverage. Please review.
Project coverage is 1.50%. Comparing base (
60b85b5) to head (d34a734).
| Files | Patch % | Lines |
|---|---|---|
| scraper/src/youtube2zim/scraper.py | 0.00% | 34 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #287 +/- ##
========================================
- Coverage 1.54% 1.50% -0.05%
========================================
Files 11 11
Lines 1102 1132 +30
Branches 162 170 +8
========================================
Hits 17 17
- Misses 1085 1115 +30
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Original goal I expressed in the issue was to avoid being blocked by yt-dlp ban when subtitles did not changed.
The reason I used yt-dlp to get the list of subtitles is that the scraper doesn't know which language subtitles need to be downloaded. Without the --all-subtitles flag, only the default available subtitles for the video are downloaded. With the --all-subtitles flag, auto-generated subtitles are included as well.
To avoid calling yt-dlp entirely in this scenario, we could save two zipped files in the S3 cache:
-
subtitles/{video_id}/default.zip- containing only the default subtitles. -
subtitles/{video_id}/all.zip- containing all subtitles when--all-subtitlesis passed.
WDYT?
This seems a potential improvement, but does it really help the scraper run faster? Because it has the drawback that we do not know when we should invalidate these to update them.
Let's pause this issue/PR to let me reflect a bit on this