warc2zim
warc2zim copied to clipboard
Use best icon possible as ZIM illustration
Fix #352
Changes in logic around finding the ZIM illustration:
- build a sorted list of potential icons to use (instead of just a "random" one)
- prefer to use best icon in this list, either from WARC (preferably, if available) or from download (if available)
- if both fails for best icon, process next icon in the list (and so forth, until we find one working)
- if all fails, fallback to default scraperlib illustration
Searching inside the WARC is still preferred to download because it is expected to be quite fast: we already know the list of expected items and the icon is usually at the very beginning of the WARC (when present).
Note that the fact that this change contradicts significantly with what has been discussed and decided in https://github.com/openzim/warc2zim/issues/202, since there is a significant chance we will download the illustration.
In #202, we said that there is probably no situation where the best icon is not already present inside the WARC and should be downloaded. This is wrong.
This change is grounded on a real use case: https://womenshistory.si.edu/. In this use case, we have only two WARC items fetched by the crawler:
- https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/favicon.ico (which is 16x16)
- https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/favicon-32x32.png (which is 32x32)
Both are too small for a ZIM illustration and will need to be upscaled. This is not appropriate because scraper could know that best icon possible is https://womenshistory.si.edu//sites/default/themes/si_sawhm/favicons/android-chrome-192x192.png, and this file is available for download. This is now what will happen with the change in this PR.