warc2zim icon indicating copy to clipboard operation
warc2zim copied to clipboard

Allow to specify the ZIM scraper metadata at the CLI

Open benoit74 opened this issue 1 year ago • 5 comments

Currently, warc2zim sets the ZIM scraper metadata as warc2zim x.y.z (where x.y.z is warc2zim version).

It would be beneficial if we could instead specify the scraper at the CLI for cases where warc2zim is only used as a dependency in another scraper, since it is not totally correct to say that warc2zim who is the scraper.

E.g. when it is zimit who is "using" warczim, we would pass zimit u.v.w as --scraper and the final ZIM scraper metadata would become zimit u.v.w

warc2zim version won't be obvious anymore in the ZIM, but just like any other scraper dependency. But it is still possible to find warc2zim version based on zimit version, since it is either explicit in Zimit changelog or at least in Zimit git history.

benoit74 avatar Jan 25 '24 07:01 benoit74

hum, warc2zim is not a dependency, it's the scraper. It's zimit that is a wrapper that calls two tools. zimit version is mostly useless but what you want is to know the crawler version I guess.

I think it would be easy and more useful to combine both. It may change in the future (and then we could simplify) but at the moment and for a long time we've used :dev zimit images which would defeat the purpose of this ticket.

This can be passed as a --scraper param as you suggest but the value would include both version (even including browsertrix's one would be better IMO)

rgaudin avatar Jan 25 '24 10:01 rgaudin

If there is any change in how the scraper is named, I'll need to know, because it's one of the ways I detect zimit2. Basically, the method I'm currently using is:

  1. If the file declares a MIME type of warc-headers, it must be Zimit classic (I know this could change in the future if we decide to reintroduce headers, but for now, that's a pretty safe assumption I think)
  2. If the scraper is declared as warc2zim (I could also use zimit) and there are no warc-headers, then it (currently) must be zimit2
  3. Otherwise, it's an openZIM format

I need to detect zimit2 because I don't benefit from the libkiwix patch that helps to determine if an external link really is an external link, given Wombat's aggressive rewriting. I don't want to do a lookup to see if wombat is in the ZIM (another detection method) because it would be too slow on some platforms to do this at the time that I am parsing the ZIM (on launch). There are some other KJS overrides that need me to be able to detect zimit2 as well (PDFs and the sandbox, for example; insertion of DarkReader scripts in the correct iframe is another one).

Hence, I need to check the Scraper metadata (or some other metadata) that allows me to distinguish quickly.

I think the current choice of warc2zim x.x.x is fine (it works for my purposes), but so long as the name makes it clear in some way that it is a zimit-family ZIM, and remains stable and reliable for all such ZIMs, I can use it.

Jaifroid avatar Jan 25 '24 10:01 Jaifroid

You could look for warc2zim and zimit strings in Scraper metadata and if _sw:yes is not in tags, that's a zimit2.

@mgautierfr this line should be fixed ASAP

rgaudin avatar Jan 25 '24 10:01 rgaudin

You could look for warc2zim and zimit strings in Scraper metadata and if _sw:yes is not in tags, that's a zimit2.

Thanks, that's a helpful alternative suggestion! I hadn't noticed those tag metadata. I'll explore that for future releases.

EDIT: I confirm that current zimit2 dev archives have _sw:yes in their tags metadata, but that editing the identified line above should fix it.

Jaifroid avatar Jan 25 '24 11:01 Jaifroid

I think it would be easy and more useful to combine both.

I'm fine with the idea to combine both, indeed I even had some hesitation. I would just rename the CLI arg --scraper-suffix so that it is more clear that it will be combined with warc2zim value.

It may change in the future (and then we could simplify) but at the moment and for a long time we've used :dev zimit images which would defeat the purpose of this ticket.

This is not true (anymore? my fault?) in farm.openzim.org ; but it is on youzim.it

Anyway, even with dev Docker image, I think it will still have a dev version like 1.3.2-dev0; it might be a bit unclear what dev0 means, but:

  • we could be more strict and recommend to increase this dev0 at every merge to main in Zimit scraper (at least until we use Docker dev tag in production)
  • even if we are not careful and forget, at least we would now that it is something between 1.3.1 and 1.3.2 that created the ZIM

This can be passed as a --scraper param as you suggest but the value would include both version

I would prefer to rename the CLI arg --scraper-suffix so that it is more clear that it will be combined with warc2zim value.

(even including browsertrix's one would be better IMO)

OK

benoit74 avatar Jan 25 '24 13:01 benoit74

Already done in 2.0.0 with https://github.com/openzim/warc2zim/pull/168

benoit74 avatar Jun 18 '24 07:06 benoit74