zimit How to derive the `dirEntry.url` from video links or IDs?

How to derive the `dirEntry.url` from video links or IDs?

Open Jaifroid opened this issue 2 years ago • 13 comments

I'm sorry for an enquiry that sounds non-generic, but I think the answer would in fact be generic to the Zimit format (which unfortunately doesn't have a documented specification)... @rgaudin , I think you might be able to help here, or at least set me in the right direction.

As you know, I have built experimental support for Zimit ZIMs into Kiwix JS Windows/Linux. For the most part this works really well, particularly so with recent Zimit-created ZIMs. It is done on the basis of transforming URLs, which can mostly be transformed predictably from absolute, semi-absolute, or relative URLs to an asset in the ZIM. I realize this isn't how Zimit is supposed to work, but it is not possible to run the Replay engine given its dependency on Service Workers.

However, one thing is stumping me. In the www.ready.gov_en_2022-05.zim, I can get almost every asset in the ZIM from a hyperlinked URL except MP4 videos. These seem to bear no relationship to the hyperlink that supposedly links to them (mostly a youtube hyperlink). Below is one (crazy) URL of a video asset in the ZIM (this is from the dirEntry.url field). These assets can be accessed from the search bar, by searching with regex syntax .*mp4. The assets are in the ZIM, and can be played by clicking on the title entry link, but are only accessible with this workaround. The user would have no idea what this URL refers to, as there is no title field. All the MP4 URLs look like this (this is just one URL, percent-encoded):

rr1---sn-4g5ednse.googlevideo.com/videoplayback?expire=1651677341&ei=PURyYqCPMJThgAeKwZeICw&ip=147.229.8.218&id=o-APh5EtBL0NTjfcHb0m-4-PGOh2tuiZ5hjHonmnOoGPne&itag=18&source=youtube&requiressl=yes&spc=4ocVC-hXiBIt3-K1mf4kjPFso1Qi&vprv=1&mime=video%2Fmp4&ns=q7pH103rpw6EX1oEKuRqWq0G&cnr=14&ratebypass=yes&dur=116.540&lmt=1636015825107598&fexp=24001373,24007246&c=WEB_EMBEDDED_PLAYER&txp=6218224&n=hgs_eQWhXl-o8w&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cspc%2Cvprv%2Cmime%2Cns%2Ccnr%2Cratebypass%2Cdur%2Clmt&sig=AOq0QJ8wRQIhAK7i67uMNCFp5hgbbkaj7pUHB_rAWPVVecgxVOtGZgD-AiAzqfYFjUbcNI-76WK3QHMKFy0xKaF4HHlSIQceHcjjhw%3D%3D&cpn=0aiH2ERXk8nszRIg&cver=1.20220501.00.00&ptk=youtube_none&pltype=contentugc&cm2rm=sn-cgovpm-cg9l7e,sn-2gbed7d&req_id=1db024a72602a3ee&redirect_counter=2&cms_redirect=yes&cmsv=e&mh=lj&mm=34&mn=sn-4g5ednse&ms=ltu&mt=1651655556&mv=m&mvi=1&pl=17&lsparams=mh,mm,mn,ms,mv,mvi,pl&lsig=AG3C_xAwRAIgRho-GTBNwwomCIzyzH3fczTRllNBk_ns4MwRfS4vis8CIFnb9RNAfGgkjLzhfB0Fy-WGCto7ww6QR1cZRkYE7Jxy

So, finally to the question: How do I derive this from from the simple youtube video hyperlinks in the ZIM? I understand this is something to do with the Headers stored under namespace H, but with no specification, it is very hard to reverse engineer this... Any guidance would be greatly appreciated!

May 10 '22 19:05 Jaifroid

@Jaifroid, I think it's important at this stage that we lay out the different pieces at play:

browsertrix-crawler is a crawler that creates WARC archives and is maintained by browsertrix
warc2zim is a tool that produces a ZIM files off a collection of WARC files.
zimit is just a docker shell to launch a crawl followed by a warc2zim call in a single run.

There is thus no zimit format. There is a WARC format that is clearly documented and then there's the warc2zim format that indeed is not documented but the source code is small and understandable.

""" warc2zim conversion utility
This utility provides a conversion from WARC records to ZIM files.
The WARCs are converted in a 'lossless' way, no data from WARC records is lost.
Each WARC record results in two ZIM items:
- The WARC payload is stored under /A/<url>
- The WARC headers + HTTP headers are stored under the /H/<url>
Given a WARC response record for 'https://example.com/',
two ZIM items are created /A/example.com/ and /H/example.com/ are created.
Only WARC response and resource records are stored.
If the WARC contains multiple entries for the same URL, only the first entry is added,
and later entries are ignored. A warning is printed as well.
"""

In addition, a WARC replayer from webrecorder is bundled. AFAIK, there is only one replayer ; with kiwix switches.

Now, obviously, this format is in conflict with ZIM promises as we know them. It's a hack to store a WARC archiving-replaying system into a ZIM file.

Regarding replaying and URLs, I believe all of your questions can be answered in either wabac.js or replayweb.page. I have zero knowledge of how the replaying works though.

I'm closing this but please reopen should you have additional concerns.

May 11 '22 16:05 rgaudin

@rgaudin Thank you very much, this is incredibly helpful. I somehow hadn't seen that (clear) information about how the information is split between H and A namespaces. I was aware (from comments) that "WARC" and "http" headers are stored in H namespace, but could not find the clear explanation above (due to looking in the wrong place). Thank you also for pointing out where the source code is that I could use to understand the mechanism. It's the missing piece of the puzzle. Please note, none of what I wrote was intended as criticism, it was just setting out the problem and asking for help (info) to resolve it.

May 11 '22 17:05 Jaifroid

@rgaudin Would it be worth opening this issue (with different title) on the warc2zim repo? The underlying issue (impossibility of associating YouTube videos that are in the ZIM with corresponding HTML links), seems very much a live issue to me.

May 31 '22 11:05 Jaifroid

I don't think we can do anything at warc2zim level about that. YT uses crazy links because there is no link to the MP4 but it uses its JS player. The replayer has many tricks to make it work. One of them relies on somewhat matching (fuzzy) extra entries that are built using https://github.com/openzim/warc2zim/blob/master/src/warc2zim/main.py#L82

IMO, it is completely normal (from Google perspective) that you can't match those files with an entry.

May 31 '22 12:05 rgaudin

@rgaudin But it's not working in Kiwix Serve or in the Android app currently, at least in latest ZIM you sent or in the ready.gov ZIM which has a page of video resources and links. So the Replay tricks are not working either. I have looked at the fuzzy entries in the H/ namespace, but I can't see any way of associating them either. But I'll look at source code in case I've missed something.

I'll give you an example of the problem. This is not just a problem for me, but for the other apps too.

On the Ready Gov "Preparedness Videos" page, there is a link to this video on preparing for disaster for people with disabilities:

A/youtu.be/ZLLMDOScE4g

This is not found in the ZIM. There is no entry for anything in the A or H namespaces that has the only identifying string here (ZLLMDOScE4g) - I have an algorithm in Kiwix JS Windows for matching any url or title in any namespace against a regular expression like H/.*ZLLMDOScE4g.

However, the video exists in the ZIM. By trial and error I found it, and it has the following fuzzy entry in the H/ namespace:

WARC/1.0
WARC-Type: revisit
WARC-Record-ID: 
WARC-Target-URI: https://youtube.fuzzy.replayweb.page/videoplayback?id=o-APh5EtBL0NTjfcHb0m-4-PGOh2tuiZ5hjHonmnOoGPne&itag=18
WARC-Date: 2022-05-04T10:32:22Z
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Refers-To-Target-URI: https://rr1---sn-cgovpm-cg9l.googlevideo.com/videoplayback?expire=1651677341&ei=PURyYqCPMJThgAeKwZeICw&ip=147.229.8.218&id=o-APh5EtBL0NTjfcHb0m-4-PGOh2tuiZ5hjHonmnOoGPne&itag=18&source=youtube&requiressl=yes&mh=lj&mm=31%2C29&mn=sn-cgovpm-cg9l%2Csn-2gb7sn7k&ms=au%2Crdu&mv=m&mvi=1&pl=17&initcwndbps=616250&spc=4ocVC-hXiBIt3-K1mf4kjPFso1Qi&vprv=1&mime=video%2Fmp4&ns=q7pH103rpw6EX1oEKuRqWq0G&cnr=14&ratebypass=yes&dur=116.540&lmt=1636015825107598&mt=1651655572&fvip=6&fexp=24001373%2C24007246&c=WEB_EMBEDDED_PLAYER&txp=6218224&n=hgs_eQWhXl-o8w&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cspc%2Cvprv%2Cmime%2Cns%2Ccnr%2Cratebypass%2Cdur%2Clmt&sig=AOq0QJ8wRQIhAK7i67uMNCFp5hgbbkaj7pUHB_rAWPVVecgxVOtGZgD-AiAzqfYFjUbcNI-76WK3QHMKFy0xKaF4HHlSIQceHcjjhw%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AG3C_xAwRQIgEGxdVyWHEc7_FLh84g4qfC4z5ryJ7GgtMdgi4kytBu4CIQCBnMBzfRdmlxes9FNOjC1kOlZzFU_gXFFjbJswoJ2bXA%3D%3D&cpn=0aiH2ERXk8nszRIg&cver=1.20220501.00.00&ptk=youtube_none&pltype=contentugc
WARC-Refers-To-Date: 2022-05-04T10:32:22.414739
WARC-Payload-Digest: 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ

302 Redirect
Location: https://rr1---sn-cgovpm-cg9l.googlevideo.com/videoplayback?expire=1651677341&ei=PURyYqCPMJThgAeKwZeICw&ip=147.229.8.218&id=o-APh5EtBL0NTjfcHb0m-4-PGOh2tuiZ5hjHonmnOoGPne&itag=18&source=youtube&requiressl=yes&mh=lj&mm=31%2C29&mn=sn-cgovpm-cg9l%2Csn-2gb7sn7k&ms=au%2Crdu&mv=m&mvi=1&pl=17&initcwndbps=616250&spc=4ocVC-hXiBIt3-K1mf4kjPFso1Qi&vprv=1&mime=video%2Fmp4&ns=q7pH103rpw6EX1oEKuRqWq0G&cnr=14&ratebypass=yes&dur=116.540&lmt=1636015825107598&mt=1651655572&fvip=6&fexp=24001373%2C24007246&c=WEB_EMBEDDED_PLAYER&txp=6218224&n=hgs_eQWhXl-o8w&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cspc%2Cvprv%2Cmime%2Cns%2Ccnr%2Cratebypass%2Cdur%2Clmt&sig=AOq0QJ8wRQIhAK7i67uMNCFp5hgbbkaj7pUHB_rAWPVVecgxVOtGZgD-AiAzqfYFjUbcNI-76WK3QHMKFy0xKaF4HHlSIQceHcjjhw%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AG3C_xAwRQIgEGxdVyWHEc7_FLh84g4qfC4z5ryJ7GgtMdgi4kytBu4CIQCBnMBzfRdmlxes9FNOjC1kOlZzFU_gXFFjbJswoJ2bXA%3D%3D&cpn=0aiH2ERXk8nszRIg&cver=1.20220501.00.00&ptk=youtube_none&pltype=contentugc

Although this gives us the URL of the asset, you can see for yourself (use Ctr-F on this page!) that there is absolutely nothing that matches ZLLMDOScE4g either in this entry or in the URL of this entry which is H/youtube.fuzzy.replayweb.page/videoplayback?id=o-APh5EtBL0NTjfcHb0m-4-PGOh2tuiZ5hjHonmnOoGPne&itag=18.

Clearly something crucial has been left out in the URL when storing it in the ZIM. Is there any way to tell whether warc2zim or an upstream issue is the problem here?

May 31 '22 12:05 Jaifroid

https://youtu.be/ZLLMDOScE4g is not a resource; it's a webpage. It should not point to an MP4 resource. You are saying there is a link. What kind of link is that? On which page it is? I can't say but I can imagine that depending on how the video is embedded, there can be a JS piece that converts from the human ID URL to another one.

May 31 '22 15:05 rgaudin

@rgaudin See this page:

https://library.kiwix.org/www.ready.gov_en_2022-05/A/www.ready.gov/videos

There are loads of links to videos (OK, to video webpages!) there. There is also one embedded video that doesn't play. All that the Replay system offers to do is to open the linked videos as external links. However, all the videos are included in the ZIM. I can extract them and play them no problem from title search (simple search for .*mp4 finds them), but I don't know how to get from these video webpage links to the asset that is in the ZIM. Nothing in the URL of the asset in the ZIM tells us what it is. There is no human-readable title either. The URL is a mass of code, and none of it matches the video ID we have from the link.

At the time the page was scraped, the videos were obviously extracted and stored, but the information needed to find the asset was not recorded (seemingly), not even a way to find the fuzzy URL (in H/ namespace) that does point to the actual URL of the video.

I think that something very simple has occurred. A fuzzy URL was created as a bridging link/redirect, but its ID was not then added to the video/page URL that was stored in the ZIM (a simple querystring would have been enough). It's like one crucial bit of information got stripped from the URL, or was never added to it. It might be quite an easy fix, since everything else seems to be in place for videos to work.

Anyway, I hope this information is helpful.

May 31 '22 16:05 Jaifroid

However, all the videos are included in the ZIM. I can extract them and play them no problem from title search (simple search for .*mp4 finds them)

That looks like a bug. Those are just links so videos should not be included in the ZIM. Unless those are embedded elsewhere in the website (which I doubt!) I suppose the crawler incorrectly though those links were embeds and applied the youtube-specific scraping rules.

It's like one crucial bit of information got stripped from the URL, or was never added to it.

I am pretty sure it wasn't stripped out but it wasn't added neither because it's information from a different source (probably available to the crawler at that time) that was not needed by the replay system to work.

Anyway, I would advise not to use this page for your work as you are obviously relying on a side effect of a bug in the crawler (which might have been fixed in a following release) as those videos should not be present: those are external links.

Jun 01 '22 09:06 rgaudin

OK, thanks for the explanation. I guess what happens here is that the scraper has youtube as a whitelisted domain for the purposes of getting videos, so when it sees links to youtube video (pages) it follows them and gets the video.

It's kind of arguable if that's a bug or a feature. Clearly the implementation is buggy, or is a side-effect. But many organizations now use YouTube to host videos for their Web sites -- videos that are in fact content intended for users of their Web sites. Here it would actually make a lot of sense for those videos to be available. If you look at the titles, you will see that they are integral to the content of the ready.gov Web Site, and are all videos produced by them.

Should the scraper actually differentiate between a video that is included in a player on a page, and a video that is a link, given that both are legitimate ways of directing the user to video content? They are not fundamentally different things (an embedded video is actually a link to external content that is played in a div or a frame), and there may be accessibility reasons why a link is preferred to an embedded video for some sites.

Just some food for thought!

Jun 01 '22 10:06 Jaifroid

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

Aug 13 '22 10:08 stale[bot]

@mgautierfr wrote in https://github.com/openzim/warc2zim/issues/99: The problem is no one knows (or want to explains) what is the 20% situation 😉

@mgautierfr Your comment on the Pareto principle was about when the H-prefixed Headers need to be used in a Zimit archive. One 20% situation concerns video and POST requests (which is why I'm posting this on this video-related issue, as a way of docuenting this situation). I ran into this while trying to fix YouTube video rendering for Zimit archives in Kiwix JS Windows/Linux:

As you can see, the embeded JavaScript video player (base.js) tries to send POST requests to a YouTube URL, which here has been translated to a ZIM URL, so the Service Worker can intercept it.

But of course a POST request won't work to a non-dynamic non-media server, so what has to happen under the hood is that the Service Worker must intercept this request, check the H-header entry for this POST request, and return the correct Response from the ZIM. Furthermore, because we can't control exactly the format of the request emitted by a complex JavaScript file, there is a concept of "fuzzy matching" for some of these requests.

It turns out @ikreymer documented precisely this situation here: https://github.com/openzim/warc2zim/issues/80.

Sep 09 '22 17:09 Jaifroid

By way of documentation, it is worth explaining what is required to derive the URL of the video BLOB from the POST requests that the embedded video player makes.

Unfortunately, the embedded video player emits a POST request which does not correspond to any URL in the ZIM. The only distinguishing information in the emitted URL is (in the case of YouTube videos) a key field in the querystring, e.g.:

C/A/www.youtube.com/youtubei/v1/player?key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8&prettyPrint=false

This URL is not in the ZIM, but if we look up the key, we find several very long URLs that contain it:

Unfortunately, each of these (many) URLs containing the key corresponds to a different video in the ZIM. To find out which one we want, we need to find a URL that also contains the video's videoId, as given in the direct YouTube fallback link in the HTML of the player, for example videoId=WZHIk0k4Gpg.

However, this still doesn't give us the video BLOB. To get that, we need to get another field from the URL. In the case of YouTube videos, this field is a cpn field, e.g. cpn=u7IObzfkqJ_Y4YmW. There is also an ei field that links the records, e.g. ei=OswYY8_mLqGA6dsP982YuAU. When we look for these data in the ZIM's URL index, finally we can identify the URL of the video BLOB, which we can return to the calling function:

The "fuzzy" part seems to refer to partial string/URL matching, since we only have part of the URL we are looking for. This is automated in the Replay system by looping through the query string and scoring matches, presumalby to make it generic, though in fact there are a lot of specific regexes that are needed to deal with specific cases such as YouTube, Vimeo, the Washington Post, etc.: see https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js.

Some helper, or intermediate, URLs are stored in the ZIM in the form of H-prefixed Headers with fuzzy in the URL name. However, in practice these do not appear as helpful as I was hoping they would be, and they often do not contain any more information than is already contained in the A-prefixed URLs in the ZIM. The Headers can be used to cause the browser to redirect to a new URL in server / Service Worker contexts.

Sep 11 '22 13:09 Jaifroid

Very helpful ; thank you !

Sep 11 '22 14:09 rgaudin

zimit zimit copied to clipboard

How to derive the `dirEntry.url` from video links or IDs?

zimit
zimit copied to clipboard