zimit
zimit copied to clipboard
Assets inside `<noscript>` tags are not fetched when running a zimit scrape
I was testing the Docker image of openzin/zimit on a Wikivoyage page (one that happens to be missing from our Wikivoyage ZIM, but that is immaterial). Due to the image lazy-loading script employed by Wikimedia web sites, all images other than the first one on a page are enclosed in <noscript>...</noscript>
tags. Furthermore, there is a placeholder <span>
tag used to reserve space for the image.
The Zimit scraper implemented here does not scrape assets enclosed in the <noscript>
tags. And if the lazy load script runs inside the ZIM, it attempts to load the image from the online server instead of from the ZIM (but the asset is simply not in the ZIM in any case).
I think it's immaterial that we have proper ZIMs for Wikimedia sites, as many other sites use these lazy loading techniques, and they seem to defeat the scraper used by Zimit.
Limitation: I ran the scrape locally with a limit of 100 pages, but I don't think that affects the pulling of assets from a particular page.
On further testing, I'm not sure this is 100% true. Some of these images appear to have been scraped, but their src
as given in the ZIM does not appear to correspond to the file's title
(after transforming absolute references to ZIM URLs). This needs further investigation.
I've determined that this issue is still valid. The scraped ZIM contains very similar filenames that are scraped from the Wikimedia Commons File Entry for the image (since each image in Wikimedia ZIMs is hyperlinked to a File:[name] entry on Wikimedia Commons). But the filenames are different sizes of the image. The images themselves that are inside <noscript>
blocks are not scraped UNLESS there is a corresponding image size entry on the Wikimedia Commons File:[name] page.
A further complication arises with the way filenames are encoded, but that is a separate issue.
@Jaifroid Do you have an exact command to run as example?
Yes, I used the following (output path needs tweaking for user's context):
docker run -v C:\Users\jaifroid\Source\Repos\zimit\:/output -w /output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit zimit --url https://en.m.wikivoyage.org/wiki/Cambridge --name myzimfile --workers 2 --limit 100 --scroll
Only the first two main images of the landing page are in the ZIM. The ZIM contains closely related (but inexactly named) image files for (some of) the missing images because it has scraped the corresponding Wikimedia Commons 'File:...' page for each image.
I think the zimit --scroll
is designed to overcome precisely this issue, but it doesn't seem to work. The documentation on its use is ambiguous. Running zimit --help
we get:
--scroll If set, will autoscroll to bottom of the page
But the readme on GitHub says:
--scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
Unfortunately, passing a number as N (either as --scroll 5
or as --scroll [5]
, though I presume the brackets just mean the value is optional) results in a warc2zim error FileNotFoundError: [Errno 2] No such file or directory: '5'
. Clearly zimit is erroneously passing the optional value to warc2zim, which doesn't know what to do with it.
zim it is not using the latest version of the crawler. Maybe this option changed? We're waiting for a new pylibzim release to update zimit to latest crawler and replayer.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
--scroll
option has been removed.
So I guess we need to test again whether the issue persists.