warc2zim
warc2zim copied to clipboard
Allow to specify the ZIM scraper metadata at the CLI
Currently, warc2zim sets the ZIM scraper metadata as warc2zim x.y.z
(where x.y.z
is warc2zim version).
It would be beneficial if we could instead specify the scraper at the CLI for cases where warc2zim is only used as a dependency in another scraper, since it is not totally correct to say that warc2zim who is the scraper.
E.g. when it is zimit who is "using" warczim, we would pass zimit u.v.w
as --scraper
and the final ZIM scraper metadata would become zimit u.v.w
warc2zim
version won't be obvious anymore in the ZIM, but just like any other scraper dependency. But it is still possible to find warc2zim
version based on zimit
version, since it is either explicit in Zimit changelog or at least in Zimit git history.
hum, warc2zim is not a dependency, it's the scraper. It's zimit that is a wrapper that calls two tools. zimit version is mostly useless but what you want is to know the crawler version I guess.
I think it would be easy and more useful to combine both. It may change in the future (and then we could simplify) but at the moment and for a long time we've used :dev
zimit images which would defeat the purpose of this ticket.
This can be passed as a --scraper
param as you suggest but the value would include both version (even including browsertrix's one would be better IMO)
If there is any change in how the scraper is named, I'll need to know, because it's one of the ways I detect zimit2. Basically, the method I'm currently using is:
- If the file declares a MIME type of warc-headers, it must be Zimit classic (I know this could change in the future if we decide to reintroduce headers, but for now, that's a pretty safe assumption I think)
- If the scraper is declared as
warc2zim
(I could also usezimit
) and there are no warc-headers, then it (currently) must be zimit2 - Otherwise, it's an openZIM format
I need to detect zimit2 because I don't benefit from the libkiwix patch that helps to determine if an external link really is an external link, given Wombat's aggressive rewriting. I don't want to do a lookup to see if wombat is in the ZIM (another detection method) because it would be too slow on some platforms to do this at the time that I am parsing the ZIM (on launch). There are some other KJS overrides that need me to be able to detect zimit2 as well (PDFs and the sandbox, for example; insertion of DarkReader scripts in the correct iframe is another one).
Hence, I need to check the Scraper metadata (or some other metadata) that allows me to distinguish quickly.
I think the current choice of warc2zim x.x.x
is fine (it works for my purposes), but so long as the name makes it clear in some way that it is a zimit-family ZIM, and remains stable and reliable for all such ZIMs, I can use it.
You could look for warc2zim
and zimit
strings in Scraper
metadata and if _sw:yes
is not in tags, that's a zimit2.
@mgautierfr this line should be fixed ASAP
You could look for
warc2zim
andzimit
strings inScraper
metadata and if_sw:yes
is not in tags, that's a zimit2.
Thanks, that's a helpful alternative suggestion! I hadn't noticed those tag metadata. I'll explore that for future releases.
EDIT: I confirm that current zimit2 dev archives have _sw:yes
in their tags metadata, but that editing the identified line above should fix it.
I think it would be easy and more useful to combine both.
I'm fine with the idea to combine both, indeed I even had some hesitation. I would just rename the CLI arg --scraper-suffix
so that it is more clear that it will be combined with warc2zim value.
It may change in the future (and then we could simplify) but at the moment and for a long time we've used :dev zimit images which would defeat the purpose of this ticket.
This is not true (anymore? my fault?) in farm.openzim.org
; but it is on youzim.it
Anyway, even with dev
Docker image, I think it will still have a dev version like 1.3.2-dev0
; it might be a bit unclear what dev0
means, but:
- we could be more strict and recommend to increase this
dev0
at every merge tomain
in Zimit scraper (at least until we use Dockerdev
tag in production) - even if we are not careful and forget, at least we would now that it is something between
1.3.1
and1.3.2
that created the ZIM
This can be passed as a --scraper param as you suggest but the value would include both version
I would prefer to rename the CLI arg --scraper-suffix
so that it is more clear that it will be combined with warc2zim value.
(even including browsertrix's one would be better IMO)
OK
Already done in 2.0.0 with https://github.com/openzim/warc2zim/pull/168