python-scraperlib Video related primitives should be provided in scraperlist

We are publishing more and more ZIM files with videos using many different scrapers.

Do do that we mainly:

Re-encode videos/audio streams
Handle sub-titles
Display all of this using video.js

For the moment many of these pieces of video related functions are distributed in different places (in the scraper relying on them).

At least to me, this is:

difficult to understand which scraper is at which status related to the development of these pieces of code
Fixes will have to be requested/tracked in each scraper
Releases are also de-facto made at very different times

I'm not prescriptive about the exact solution, but I believe we should try to consolidate this at one place.

Sep 19 '24 18:09 kelson42

For me this is a problem of dependencies management: which software is using which version of which library.

Would a dashboard of dependencies versions per scraper help? I'm not sure it is sufficient because one still needs to know that problem x has been fixed in version x.y of dependency nnn.

What makes it even harder is that we need a solution which can handle both Python (because video re-encoding is done in scraperlib so we want to track the scraperlib version per scraper) and JS (because display is done with video.js which is ... JS).

Would it be me, I would propose a very radical solution, because the problems you describe are typically the strong argument for a mono-repo of all scrapers : all scrapers are at the same level of development, fixes have to be requested and tracked only once, releases are synchronized. Unfortunately it comes with its own share of drawbacks.

Sep 20 '24 06:09 benoit74

@benoit74 monorepo is a nogo to me. For the rest, I'm very open. If not technical solution can be found, we should at least have a procedural approach.

Oct 21 '24 13:10 kelson42

Since monorepo is a nogo, there is nothing but procedural approach / tooling to solve the problem you're describing, because since you would like to have an overview of the situation, we will always need to have a kind of dashboard allowing to:

track issues known to impact many ZIMs / scrapers and whether they have to be fixed by a dependency update - with minimum required version (because dependency has a fix) - or by a code change at scraper level - and once fixed at scraper level which scraper version contain the fix
track which ZIM is built with which scrapers versions (because even if scraper is released, until the ZIM is not updated, problem is not fixed for our users)
track which versions of dependencies are used in which scrapers versions (because fixing a problem in a shared dependency is great, but not fixing the end-problem until used and released in the scrapers)

This is the solution to be able to quickly say that something like "issue xxx is fixed by updating dependency xyz to version x.y.z, this version has been deployed in scraper aaa version x.y.z and scraper bbb y.z.x, not yet in other scrapers, and we have xxx ZIMs using version aaa or newer, but zzz ZIM still using older versions".

To me this is not a small thing to build / deploy because we need tooling for that. We need to find funding to develop/configure this tooling / procedures.

Without that funding / tooling, we are back to square one, doing all this manually when needs arise.

Oct 22 '24 06:10 benoit74