youtube icon indicating copy to clipboard operation
youtube copied to clipboard

Rework or remove the `playlist_mode`

Open benoit74 opened this issue 1 year ago • 2 comments

In this scraper, we have a playlist_mode which allows to create a ZIM per playlist found in a given Youtube user / channel.

This mode is convenient to create many ZIMs at once, but it poses an issue in terms of metadata quality since titles, descriptions, ... are automatically sourced from Youtube.

With the move to scraperlib 3.x, the creation of ZIMs with invalid title, description, ... will fail. Unfortunately, this check is done only at the end of the scraping since we still use the "zimwriterfs" mode with make_zim_file at the end of the scraper, after all videos have been downloaded and reencoded.

We should either:

  • invest time to rework the functionality, compute only valid metadata (even if this is deemed to be limited) and check this validity before starting any content processing
  • remove the functionality since we won't use it on the Zimfarm, do not have sufficient resources to do the rework and the rework quality will anyway provide results of limited quality or require vast efforts (e.g. computing a 30 chars Title automatically is not an easy feat)

This is a blocker for #175 in fact (or we accept to have a functionality which will not work in 90% of the cases)

benoit74 avatar Dec 19 '23 09:12 benoit74

As discussed live and proposed in https://github.com/openzim/python-scraperlib/issues/119, we could just disable the metadata check in scraperlib.

This could be an opt-in flag in general, and the default when using playlist mode. And we could display a warning when metadata is not valid.

This would allow to continue to support this mode for the ones wanting to create their own ZIMs, while still ensuring metadata quality for openZIM files. And would allow to upgrade to 3.x in an elegant way.

@kelson42 WDYT?

benoit74 avatar Dec 19 '23 13:12 benoit74

This approach has been implemented in TED scraper: https://github.com/openzim/ted/pull/170

benoit74 avatar Mar 25 '24 09:03 benoit74