zimit icon indicating copy to clipboard operation
zimit copied to clipboard

Video on kiwix.org homepage is not retrieved

Open benoit74 opened this issue 2 years ago • 7 comments

Zimit version: 1.6.2 (not yet released, just to have the fix for --depth 0 + crawler 0.12.2)

While doing a ZIM of https://kiwix.org, the Youtube video on the home page is not present in the ZIM

How to reproduce:

zimit --url="https://kiwix.org/fr/" --depth 0 --keep --name kiwix_org 

Activating all behaviors does not help:

zimit --url="https://kiwix.org/fr/" --depth 0 --keep --name kiwix_org --behaviors autoscroll,autoplay,autofetch,siteSpecific 

I had a look at the WARCs content and the request to Youtube was not made.

Running only the crawler with official 0.12.2 image does not help (Youtube video is still not in the WARC):

crawl --depth 0 --url https://kiwix.org/fr/ --cwd /output/.tmph919m5n3

I'm going to open an upstream ticket

benoit74 avatar Nov 15 '23 14:11 benoit74

Is the scope correctly set? Because that video is from YouTube rather than kiwix.org, navigation to it might be blocked.

Jaifroid avatar Nov 22 '23 11:11 Jaifroid

Might be because this is not a regular <video /> embed but an <iframe />. I believe browsertrix considers those resources (and thus not subject to scoping) but it's worth checking if there's no request to YT.

rgaudin avatar Nov 22 '23 11:11 rgaudin

I tried many scopes, including a custom one with both youtube.com and kiwix.org domains included. Might be the <iframe /> which is the issue, you are right. Or the fact that one has to click on the button to make the iframe appear and load the iframe into the DOM (before that the video URL is only in the data-video attribute of an img. How do you wanna check if there is no request to YT? I already checked in the WARCs and there is no request to YT.

benoit74 avatar Nov 23 '23 07:11 benoit74

I didn't realize a click was needed to create the iframe on DOM. That's definitely the issue. This is not standard YT behavior and certainly not handled in browsertrix. We need to emulate that click…

As for network requests, all requests goes through pywb (set as proxy). Maybe there's a flag/env for pywb to print requests?

It that's useful enough for debugging, we could also imagine embedding a script in zimit that's just conditionnaly print/record requests and forwards them to pywb. We'd set is as the proxy

rgaudin avatar Nov 23 '23 07:11 rgaudin

We need to emulate that click…

How do you do that? With a custom behavior?

Maybe there's a flag/env for pywb to print requests?

How would that be different from WARCs content? It is already quite straightforward to display all requests stored in WARC files, so if there is no difference I would rather add a warc2zim flag to display all requests found in WARCs while processing them.

benoit74 avatar Nov 23 '23 08:11 benoit74

How do you do that? With a custom behavior?

I don't know.

rgaudin avatar Nov 23 '23 08:11 rgaudin

How do you do that? With a custom behavior?

I don't know.

Yeah this looks like a pretty classic case for a custom behavior! We have a new Tutorial on how to create them, it'd be great to see if it's useful and get any feedback on it :) https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md

tw4l avatar Nov 29 '23 16:11 tw4l

This is not a scraper issue, so closing this, we have to develop the custom behavior if we really want to make it into the ZIM, that's a "content team" issue them.

benoit74 avatar May 28 '24 12:05 benoit74

I'm a bit puzzled, what is "special" on kiwix.org web site? Standart CMS + standart video platform!

kelson42 avatar May 28 '24 14:05 kelson42

I reopen the issue just to be sure I get it right.

kelson42 avatar May 28 '24 14:05 kelson42

It's not standard, the video is not on the page, it's an iframe that is injected in a popup that is displayed upon click on a button

rgaudin avatar May 28 '24 14:05 rgaudin

Hence the need for a custom behavior to simulate the user click on the button.

Shall we close this again? (there is nothing to do on the scraper side, it is just a customization needed for this particular side which can be done without modifying the scarper at all, custom behaviors are just a JS file than can be injected on the CLI)

benoit74 avatar May 28 '24 14:05 benoit74

Actually it's still closed!

kelson42 avatar May 28 '24 14:05 kelson42