zimit Assets inside `<noscript>` tags are not fetched when running a zimit scrape

Assets inside `<noscript>` tags are not fetched when running a zimit scrape

Open Jaifroid opened this issue 2 years ago • 9 comments

I was testing the Docker image of openzin/zimit on a Wikivoyage page (one that happens to be missing from our Wikivoyage ZIM, but that is immaterial). Due to the image lazy-loading script employed by Wikimedia web sites, all images other than the first one on a page are enclosed in <noscript>...</noscript> tags. Furthermore, there is a placeholder <span> tag used to reserve space for the image.

The Zimit scraper implemented here does not scrape assets enclosed in the <noscript> tags. And if the lazy load script runs inside the ZIM, it attempts to load the image from the online server instead of from the ZIM (but the asset is simply not in the ZIM in any case).

I think it's immaterial that we have proper ZIMs for Wikimedia sites, as many other sites use these lazy loading techniques, and they seem to defeat the scraper used by Zimit.

Limitation: I ran the scrape locally with a limit of 100 pages, but I don't think that affects the pulling of assets from a particular page.

May 07 '22 15:05 Jaifroid

On further testing, I'm not sure this is 100% true. Some of these images appear to have been scraped, but their src as given in the ZIM does not appear to correspond to the file's title (after transforming absolute references to ZIM URLs). This needs further investigation.

May 07 '22 16:05 Jaifroid

I've determined that this issue is still valid. The scraped ZIM contains very similar filenames that are scraped from the Wikimedia Commons File Entry for the image (since each image in Wikimedia ZIMs is hyperlinked to a File:[name] entry on Wikimedia Commons). But the filenames are different sizes of the image. The images themselves that are inside <noscript> blocks are not scraped UNLESS there is a corresponding image size entry on the Wikimedia Commons File:[name] page.

May 07 '22 17:05 Jaifroid

A further complication arises with the way filenames are encoded, but that is a separate issue.

May 07 '22 17:05 Jaifroid

@Jaifroid Do you have an exact command to run as example?

May 07 '22 20:05 kelson42

Yes, I used the following (output path needs tweaking for user's context):

docker run -v C:\Users\jaifroid\Source\Repos\zimit\:/output -w /output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit zimit --url https://en.m.wikivoyage.org/wiki/Cambridge --name myzimfile --workers 2 --limit 100 --scroll

Only the first two main images of the landing page are in the ZIM. The ZIM contains closely related (but inexactly named) image files for (some of) the missing images because it has scraped the corresponding Wikimedia Commons 'File:...' page for each image.

I think the zimit --scroll is designed to overcome precisely this issue, but it doesn't seem to work. The documentation on its use is ambiguous. Running zimit --help we get:

--scroll If set, will autoscroll to bottom of the page

But the readme on GitHub says:

--scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds

Unfortunately, passing a number as N (either as --scroll 5 or as --scroll [5], though I presume the brackets just mean the value is optional) results in a warc2zim error FileNotFoundError: [Errno 2] No such file or directory: '5'. Clearly zimit is erroneously passing the optional value to warc2zim, which doesn't know what to do with it.

May 08 '22 07:05 Jaifroid

zim it is not using the latest version of the crawler. Maybe this option changed? We're waiting for a new pylibzim release to update zimit to latest crawler and replayer.

May 08 '22 11:05 rgaudin

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Jul 10 '22 14:07 stale[bot]

--scroll option has been removed.

Aug 08 '22 07:08 rgaudin

So I guess we need to test again whether the issue persists.

Aug 08 '22 08:08 Jaifroid

zimit zimit copied to clipboard

Assets inside `<noscript>` tags are not fetched when running a zimit scrape

zimit
zimit copied to clipboard