monolith icon indicating copy to clipboard operation
monolith copied to clipboard

Unable to Save Live Webpage: 404 Error & 'Could not retrieve target document' Error

Open hairycactus opened this issue 2 years ago • 6 comments

Monolith is unable to save the following webpage, ie. no output HTML file at all.

Sample URL: https://edition.cnn.com/interactive/2021/03/cnnix-steership

The webpage itself is accessible via a browser, & remains totally functional even when offline as long as the browser tab is open.

Commands Tried:

 monolith.exe "https://edition.cnn.com/interactive/2021/03/cnnix-steership" -o "Test.html"

OR: Using the -j option:

 monolith.exe -j "https://edition.cnn.com/interactive/2021/03/cnnix-steership" -o "Test.html"

OR: Using the -b option as suggested in this comment at Issue #19:

monolith.exe "https://edition.cnn.com/interactive/2021/03/cnnix-steership" -b "https://edition.cnn.com/interactive/2021/03/cnnix-steership" -o "Test.html"

All of the above permutations give this error:

ESC[31mhttps://edition.cnn.com/interactive/2021/03/cnnix-steership (404 Not Found)ESC[0m  
Could not retrieve target document

Affected: Monolith v2.6.1 (04 Jul 2021)
OS: Win 10 x64

Thanks in advance for any possible advice !

hairycactus avatar Oct 22 '21 10:10 hairycactus

Hi there!

curl "https://edition.cnn.com/interactive/2021/03/cnnix-steership" returns this result:

<html>
<head><title>302 Moved Temporarily</title></head>
<body>
<h1>302 Moved Temporarily</h1>
<ul>
<li>Code: Found</li>
<li>Message: Resource Found</li>
<li>RequestId: AVFJQ4CAJWD8D3JC</li>
<li>HostId: VYRQTrArU/TJQB+FM6MbFX6oMeB4m1igsvWjT+huhrlu5fX5EPFhKkseJuFaSH0Qa2+U5hcyO9M=</li>
</ul>
<hr/>
</body>
</html>

It looks like that resource is gone for good, but still can be found at: http://web.archive.org/web/20210331022955/https://edition.cnn.com/interactive/2021/03/cnnix-steership/

monolith -F -b http://web.archive.org/web/20210331022955/https://edition.cnn.com/interactive/2021/03/cnnix-steership/ http://web.archive.org/web/20210331022955/https://edition.cnn.com/interactive/2021/03/cnnix-steership/ -o cnnix-steership.html seems to be saving the page.

It has a bunch of dynamically-loaded JS there, so -I is not going to work, and hence why need the -b. Once you open the page, give a couple of seconds for assets to load (web.archive can be a bit slow), don't hit the "Start" until all of the JS is loaded — it's a bug in that web app's code, it allows to start the mini game before all of the required JS is there.

https://github.com/rhysd/monolith-of-web could help saving that page to be truly offline, but hard to say 100%, depends how they load all those JS modules.

snshn avatar Oct 22 '21 19:10 snshn

It looks like that resource is gone for good

Oops, forgot to include the final back-slash at the end of the target URL.

https://edition.cnn.com/interactive/2021/03/cnnix-steership/
This is accessible in my web browser, & loads fully instantly. I don't really have to wait before clicking "Start" -- the next screen (the game proper) is fully functional in the browser, even when offline.

Using Monolith, I tried the -F & -b options as advised, & there is only 1 os error 11004 right at the end. The same error occurs as well when I add the -I option.

Console Output:

monolith.exe -F -b "https://edition.cnn.com/interactive/2021/03/cnnix-steership/" "https://edition.cnn.com/interactive/2021/03/cnnix-steership/" -o "Test.html"  
https://edition.cnn.com/interactive/2021/03/cnnix-steership/  
 https://ix.cnn.io/static/fonts/fonts.css  
 https://cdn.cnn.com/cnn/interactive/2021/03/cnnix-steership/css/mainv3.css?v=1.2  
  https://cdn.cnn.com/cnn/interactive/2021/03/cnnix-steership/images/mapv2.png  
  https://cdn.cnn.com/cnn/interactive/2021/03/cnnix-steership/images/mapv2.png (from cache)  
  https://cdn.cnn.com/cnn/interactive/2021/03/cnnix-steership/images/inset-map.jpg  
  https://cdn.cnn.com/cnn/interactive/2021/03/cnnix-steership/images/kbIcons.png  
  https://cdn.cnn.com/cnn/interactive/2021/03/cnnix-steership/images/arrow.png  
 https://z.cdn.turner.com/cnn/.element/ssi/www/misc/4.0/static/js/jquery.min.js  
 https://i.cdn.turner.com/ads/adfuel/ais/2.1/cnni-ais.js  
 https://i.cdn.turner.com/ads/adfuel/adfuel-2.1.51.min.js  
 https://cdn.cnn.com/cnn/interactive/2021/03/cnnix-steership/images/northcompass.png  
 https://cdn.cnn.com/cnn/interactive/2021/03/cnnix-steership/js/boatV3.js?v=1.2  
ESC[31mhttps://lightning.cnn.com/launch/7be62238e4c3/97fa00444124/launch-2878c87af5e3.min.js  
(error sending request for url (https://lightning.cnn.com/launch/7be62238e4c3/97fa00444124/launch-2878c87af5e3.min.js):  
error trying to connect: dns error: The requested name is valid, but no data of the requested type was found. (os error 11004))ESC[0m  
https://edition.cnn.com/favicon.ico

The output HTML (2.2 MB filesize) loads the "Start" landing page properly. But clicking the "Start" button leads to an immediate "Collision. Try again" overlay.

The up/down arrow keys (engine power) & left/right arrow keys (rudder angle) are functional, although the ship is not moving behind the overlay. Clicking "Try again" button resets the engine power & rudder angle back to default, but the overlay remains.

Screenshot of Output HTML File:

Monolith Output File Error

hairycactus avatar Oct 25 '21 13:10 hairycactus

Hi there!

Oh, it's that elusive trailing slash...

I was able to reproduce the issue. monolith -b https://edition.cnn.com/interactive/2021/03/cnnix-steership/ -F https://edition.cnn.com/interactive/2021/03/cnnix-steership/ -o /tmp/test.html results in that ship being dead in the water, and the same command without the base (-b) option just ends up in a document that says "Collision" even before letting the user take control of the ship.

You could try using this tool https://github.com/gildas-lormeau/SingleFile — there's also a browser extension available, so it should save it with those additionally-loaded JS files. Please let me know if it works.

snshn avatar Oct 29 '21 00:10 snshn

You could try using this tool https://github.com/gildas-lormeau/SingleFile — there's also a browser extension available, so it should save it with those additionally-loaded JS files.

Thanks for the suggestion of the alternative SingleFile tool.

I don't use browser WebExtensions, & also generally prefer standalone solutions.

Meanwhile, the so-called CLI tool supplied in the master ZIP lacks a standalone compiled EXE binary that can be conventionally run from the commandline. And as an IT layperson, I have no idea how to compile an executable file from the source code.

This noob got lost at the start of CLI tool's README.MD. Dragging the supplied JS files into the browser does nothing, so I guess I shall pass.

SingleFile can be launched from the command line by running it into a (headless) browser. It runs through Node.js as a standalone script injected into the web page.

I will continue to follow the Monolith project, in hopes of a future update that solves issues like this one.

Thanks !

hairycactus avatar Nov 28 '21 02:11 hairycactus

I understand what you mean by standalone solutions, completely agreed. When I wrote Monolith originally as a POC project about 3-4 years ago, it was in JS, too, and somebody on IRC suggested to rewrite it in something else, anything else. That's the short story of how it became a binary program written in Rust. Then a couple of awesome contributors helped to get it working for most sites. It does require openssl and certificates for working with https:// sources, but so does curl, so it's as portable as a modern networking tool can be.

It looks like even if I save https://edition.cnn.com/interactive/2021/03/cnnix-steership/ using Firefox's "Save page as...", it still doesn't let the ship move — same issue as when saved with Monolith. I tried both HTML and MHTML, both Chromium and Firefox. It could be just one script that's not being able to load, or perhaps, there's an internal protection mechanism inside their apps, on JS level — if the domain is not *.cnn.com, then the application just won't work, intentionally.

I'll try to see what else could be done, maybe it's time to make another CLI tool to pre-process JS-heavy websites with asynchronously-loading script assets and then feed it into monolith to make it fully accessible while offline. I'll think something up.

Oh, and just tried to save that page with SingleFile — even worse, even the controls don't work, not just the ship.

snshn avatar Dec 06 '21 07:12 snshn

Hey there,

Controls work, but the ship is still motionless. This is the best result I've managed to get so far, using Chromium, here's the command:

chromium --headless --disable-gpu --dump-dom https://edition.cnn.com/interactive/2021/03/cnnix-steership/ | monolith - -b https://edition.cnn.com/interactive/2021/03/cnnix-steership/ -o steership.html

Chrome should work as well, not sure about Brave.

I am 99% convinced at this point that it's done on purpose by developers, to prevent it from getting pirated... no pun intended.

snshn avatar Feb 18 '22 08:02 snshn