SingleFile icon indicating copy to clipboard operation
SingleFile copied to clipboard

List of error cases

Open gwern opened this issue 2 years ago • 13 comments

While using the SingleFile-CLI to archive gwern.net external links, I review them manually and note when they don't seem to work either in Chrome SingleFile-CLI or FF SingleFile manually, and exclude the domain/URL. It might be helpful for SingleFile development to have a list of ones I exclude because they didn't work.

YouTube embeds are a particular culprit:

  • http://www.avclub.com/the-100-best-worst-and-weirdest-things-we-saw-on-the-1839566367 -- too much media
  • https://animetudes.com/2021/04/25/artist-spotlight-shoichi-masuo -- low quality (media)
  • https://medium.com/mindsoft/rats-in-doom-eb6c52c73aca -- video embed
  • https://danijar.com/project/apd -- video embed
  • https://brainwindows.wordpress.com/2009/10/14/playing-quake-with-a-real-mouse -- video embed
  • https://xbpeng.github.io/projects/VDB/index.html -- video embed
  • https://metarationality.com/rational-pcr -- video embed
  • http://lispm.de/symbolics-lisp-machine-ergonomics -- video embed
  • https://ashish-kmr.github.io/rma-legged-robots -- video embed
  • https://laion.ai/laion-400-open-dataset -- video embed
  • https://universome.github.io/stylegan-v -- video embed
  • https://ali-design.github.io/gan_steerability -- video embed
  • https://podcasts.google.com/feed/aHR0cHM6Ly9yc3MuYWNhc3QuY29tL2Rhbm55aW50aGV2YWxsZXk/episode/MDI4NDI4ODMtZmE3YS00MzA2LTk1ZGItZjgzZDdlMzAwZThk -- audio embed
  • https://fanfox.net/manga/oyasumi_punpun/v08/c084/15.html -- blocks mirroring
  • https://archiveofourown.org/works/2372021/chapters/5238359 -- blocks archiving
  • https://www.rte.ie/archives/2018/0322/949314-donegal-victorian-romantics -- video
  • https://old.reddit.com/r/MediaSynthesis/comments/c6axmr/close_the_world_txen_eht_nepo/ -- low-quality due to Imgur/image embeds
  • https://www.reddit.com/r/MediaSynthesis/comments/ssvv5e/ml_generated_pixel_art_portrait_of_david_bowie/ -- low-quality due to Imgur/image embeds
  • https://sebastianrisi.com/self_assembling_ai -- video embeds
  • https://www.janelia.org/project-team/flyem/hemibrain -- video embeds
  • https://mru.org/development-economics -- video embeds
  • https://trixter.oldskool.org -- low quality (YT embed breaks)
  • http://www.michaelburge.us/2019/05/21/marai-agent.html -- low quality (YT embed breaks)
  • http://linusakesson.net/scene/a-mind-is-born -- low quality (YT embed breaks)
  • https://ml.berkeley.edu/blog/posts/clip-art -- low-quality - archived version somehow self-redirects to an empty page?
  • https://ieeexplore.ieee.org/abstract/document/9150552 -- bad quality
  • http://magenta.tensorflow.org/maestro-wave2midi2wave -- low quality
  • https://www.nap.edu/catalog/25762/reflecting-sunlight-recommendations-for-solar-geoengineering-research-and-research-governance -- low quality
  • https://wellcomecollection.org/articles/XV_E7BEAACIAo9Vz -- low quality
  • https://www.themoneyillusion.com/why-china-should-celebrate-columbus-day -- low-quality
  • http://flashgamehistory.com -- low quality
  • https://www.jetbrains.com/lp/mono -- low quality
  • http://magenta.tensorflow.org/music-transformer -- low quality
  • http://magenta.tensorflow.org/piano-transformer -- low quality
  • https://wellcomecollection.org/works/d63gy9b7 -- low-quality
  • http://course.fast.ai/videos -- low quality
  • https://bwc.thelab.dc.gov -- low quality

gwern avatar Feb 17 '22 15:02 gwern

Thank you, I'll test them. I did some quick tests from your list and saved pages were OK on my side. Did you change any option in SingleFile?

gildas-lormeau avatar Feb 17 '22 21:02 gildas-lormeau

I've collected them over a long time so SF or the site may have changed. I have a lot of options set on the CLI but not on my FF install, so I can double-check ones which work for you.

gwern avatar Feb 17 '22 22:02 gwern

I often see a "video embed" cited as the cause of the issue. Note that SingleFile cannot save streamed videos like ones on Youtube, Vimeo etc. because this is technically quite complex.

gildas-lormeau avatar Feb 17 '22 23:02 gildas-lormeau

I wouldn't expect it to save the video (that would be quite expensive in disk space, I'm not sure I want to spend it), I'd just like it to work at all when clicked on. As awful as videos usually are, these are cases where the videos are valuable eg. the demo scene pages.

gwern avatar Feb 17 '22 23:02 gwern

Thank you for the clarification. I'll do the tests with the defaults settings then.

gildas-lormeau avatar Feb 17 '22 23:02 gildas-lormeau

Ok, I just did some tests and I understand the problem for the video/audio contents that can't be played. The problem is that SingleFile removes the source of the tags. Worse, it adds a tag to define a CSP preventing the page from making any request to the internet. So, the solution I could propose in this case would be that SingleFile injects a link to the video/audio content. By the way, for Youtube, Vimeo, etc., normally the link in the top left corner of the video already allows this. The evolution I propose would be quite similar, it requires some work and some tests...

gildas-lormeau avatar Feb 18 '22 00:02 gildas-lormeau

I would appreciate that. It keeps coming up and I keep thinking to myself, "it works on the original page, and embeds in theory can be plopped into any page on the Web and work in-place, so why doesn't it work in the SingleFile version? Maybe Gildas just doesn't know? I should report that one of these days..."

gwern avatar Feb 18 '22 00:02 gwern

You did right :) I may have to create a separate issue for this evolution. Another remark, I tested the following URLs with SingleFile CLI: https://www.avclub.com/the-100-best-worst-and-weirdest-things-we-saw-on-the-1839566367 https://fanfox.net/manga/oyasumi_punpun/v08/c084/15.html https://archiveofourown.org/works/2372021/chapters/5238359

Here the problem is that each site implements a particular technique to block the access to the content. AFAIK, there is no generic technical solution to this kind of problem. I guess a solution would be to inject as a user script a script that specifically handles this kind of problem (e.g. the code of "I don't care about cookies", https://github.com/PolishRoboDogHouse/IDCAC_uo_mirror/tree/main/src/). An alternative would be to pass your cookies to SingleFile CLI which is not great (and it would not work for data stored in the localStorage)... The simplest solution today is maybe to save them manually with the extension.

gildas-lormeau avatar Feb 18 '22 00:02 gildas-lormeau

Per-site techniques are an inevitable evil in any web archiving or browser plugin aspiring to not just punt every problem to the user, yeah.

An alternative would be to pass your cookies to SingleFile CLI which is not great (and it would not work for data stored in the localStorage)...

Does SingleFile-CLI not already make use of cookies inside the browser profile it uses? I thought when I invoked SF-CLI, even if I had to specify the list of extensions etc, it was using the default profile and so picked up any cookies or other site-specific things. That seems desirable, regardless. (These are 'public' pages which work fine logged out and so a cookie can't really be necessary, but there are lots of URLs which will require logins or require that for any usable copy. For example, at this point, Twitter.com is basically unusable if you are logged out, and I assume that Reddit at some point will 'accidentally' break old.reddit.com for logged-out users. Hm... and how does this work with extensions, anyway? If it's ignoring my profile, does that mean it ignores all of the custom uBlock rules I've set up?)

gwern avatar Feb 18 '22 23:02 gwern

By default, SingleFile does not use the user profile. But you can pass as an argument to Chrome the location of your user profile folder with the browser-args option, e.g. --browser-args ["--user-data-dir=/path/to/your/user/data/folder"]. Concerning the extensions support, they do not work in headless mode. So you also have to pass the option --browser-headless=false. By the way, someone posted today the settings he uses to support extensions here: https://news.ycombinator.com/item?id=30416968

gildas-lormeau avatar Feb 21 '22 23:02 gildas-lormeau

I have spent a while trying that, and failed completely. I reinstalled HEAD to make sure any fixes got in but no matter what combination of user-data-dir and/or profile-directory I use, it fails. (The weird JSON errors don't help.) All the obvious paths fail, and when I extract the ground truth from chrome://version per the docs, that makes little difference. I can successfully open up Chromium directly with a command like /usr/bin/chromium-browser --profile-directory="Default" --user-data-dir=/home/gwern/snap/chromium/common/chromium/ https://old.reddit.com/r/mlscaling/comments/szfey4/do_you_use_cloud_gpu_platforms/ and it will be as expected, but it just never ever works through the CLI. I deleted all other profiles, I renamed the profile to 'Default' to be sure, I nuked all old Chrome .config folders to be sure I was using the one I thought... It is extremely frustrating. The fact that nothing returns any errors anywhere makes debugging impossible. When I look at your comment and that HN comment, and I set -x to make sure I'm not being lied to by the shell and I see a command like

/home/gwern/src/SingleFile/cli/single-file --browser-executable-path /usr/bin/chromium-browser --browser-args '["--profile-directory=Default", "--user-data-dir=/home/gwern/snap/chromium/common/chromium/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/cjpalhdlnbpafiamejdnhcphjbkeiagm/1.41.4_1/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/dmghijelimhndkbmpgbldicpogfkceaj/0.4.2_0/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/doojmbjmlfjjnbmnoijecmcbfeoakpjm/11.3.3_0/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/kkdpmhnladdopljabkgpacgpliggeeaf/1.12.2_0/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/mpiodijhokgodhhofbcjdecpffjipkle/1.19.30_0/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/nkbihfbeogaeaoehlefnkodbefgpgknn/10.9.3_0/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/oolchklbojaobiipbmcnlgacfgficiig/1.3.4_0/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/padekgcemlokbadohgkifijomclgjgif/2.5.21_0/", "--load-extension=/home/gwern/snap/chromium/common/chromium/Default/Extensions/pioclpoplcdbaefihamjohnefbikjilc/7.19.0_0/"]' https://old.reddit.com/r/mlscaling/comments/szfey4/do_you_use_cloud_gpu_platforms/ /tmp/e0aef0cfb35180d0eec99dca850be3dfd880608c.html

It looks perfect. It specifies the right profile, and the right user-data-dir. It specifies all of the extensions via load-extension. And so on. And every time it pops up what looks like a completely clean temporary profile. I'm at my wit's end here. Are you sure SingleFile-CLI is not ignoring the browser-args or string-munging them in a way that leads to (completely silent) errors in Chromium? Have you verified you can make SF-CLI snapshots of a complex logged-in website like Reddit with your own Chromium account? That HN comment only talks about extensions and doesn't claim to be using the user config.

gwern avatar Feb 23 '22 17:02 gwern

@gwern sorry for the late answer, I was testing some of your URLs and it looks that most of the video embed issues are related to the fact that IFRAMEs would not be included in the saved page. Do you confirm you remove frames from saved pages (i.e. pass --remove-frames)? I need some time to reproduce and debug the issue you described. I remember testing browser-args in the past and it was working as expected. It was a long time ago though.

gildas-lormeau avatar Mar 07 '22 18:03 gildas-lormeau

I confirm that I cannot reproduce the "video embed" issue with the default settings on all pages that embed YouTube videos in the list (e.g. almost all of them). The link at the top left of the video allows you to go to YouTube to view the video in question. However, these URLs include videos that are not hosted on Youtube:

  • https://animetudes.com/2021/04/25/artist-spotlight-shoichi-masuo/
  • https://ashish-kmr.github.io/rma-legged-robots/
  • https://sebastianrisi.com/self_assembling_ai/

In the next version of SingleFile (or the master branch today), an icon will be displayed at the top left of the video to open the URL of the video.

This will not work when the source of the video is a blob: URI though, e.g. :

  • https://www.rte.ie/archives/2018/0322/949314-donegal-victorian-romantics/

gildas-lormeau avatar Mar 12 '22 00:03 gildas-lormeau