ArchiveBox icon indicating copy to clipboard operation
ArchiveBox copied to clipboard

Support: singlefile & readability fail to work

Open ghost opened this issue 1 year ago • 9 comments

For every snapshot I try, singlefile and readability fail. I assume readability may fail due to lack of the singlefile.html.

Error for singlefile:

SingleFile was not able to archive the page

If I run the command it tries to run in terminal I get:

bash: syntax error near unexpected token `('

The raw command I copied from the log section and ran to get the above:

/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/ singlefile.html

This error occurs for every article I try to archive.

Readability error is:

Readability was not able to archive the page (invalid JSON)

But, again, since I see a reference to singlefile.html in the command I expect solving the above will solve this.

The output of archivebox version is:

0.7.2
ArchiveBox v0.7.2 BUILD_TIME=2024-02-28 11:27:51 1709137671
IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-101-generic-x86_64-with-glibc2.35 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=755
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.10.12        valid     /usr/bin/python3.10                                                         
 √  SQLITE_BINARY         v2.6.0          valid     /usr/lib/python3.10/sqlite3/dbapi2.py                                       
 √  DJANGO_BINARY         v3.1.14         valid     /home/ahermitforhire/.local/lib/python3.10/site-packages/django/__init__.py 
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /home/ahermitforhire/.local/bin/archivebox                                  

 √  CURL_BINARY           v7.81.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.2         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v12.22.9        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.54         valid     ./node_modules/single-file-cli/single-file                                  
 √  READABILITY_BINARY    v0.0.11         valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/parser/cli.js                                     
 √  GIT_BINARY            v2.34.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.12.30     valid     /home/ahermitforhire/.local/bin/yt-dlp                                      
 √  CHROME_BINARY         v122.0.6261.94  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/ahermitforhire/.local/lib/python3.10/site-packages/archivebox         
 √  TEMPLATES_DIR         3 files         valid     /home/ahermitforhire/.local/lib/python3.10/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                                        

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                                        
 -  COOKIES_FILE          -               disabled  None                                                                        

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /mnt/media/ArchiveBox                                                       
 √  SOURCES_DIR           11 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           5 files         valid     ./archive                                                                   
 √  CONFIG_FILE           238.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             328.0 KB        valid     ./index.sqlite3  

ghost avatar Mar 25 '24 14:03 ghost

Try running this:

/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium 'https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/' singlefile.html 

But also you are archiving a URL that's already on the internet archive? You can try it but we don't really support that very well. You may want to follow this issue if you do that a lot: https://github.com/ArchiveBox/ArchiveBox/issues/160

pirate avatar Mar 25 '24 14:03 pirate

If I do that in terminal I get:

Unexpected token '?'

Note: the error I described happens on ANY URL I try to add as mentioned in my initial post, not just archive.org links. For example:

/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://www.theguardian.com/us-news/2016/aug/30/us-national-parks-fire-lookout-forest-wildfire singlefile.html

Gets:

bash: syntax error near unexpected token `('

(I noticed in my initial post the code block removed the symbol before the parenthesis and I have edited to reflect that)

Also, I don't plan on using the terminal over the web interface to add new snapshots. The only reason I ran the command in terminal was to get more details of the error, so I'd like to see what can be done to solve this to enable the use of the web UI. Thanks!

ghost avatar Mar 25 '24 14:03 ghost

Can you screenshot the terminal running the command and getting this error Unexpected token '?'

(manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing error near unexpected token (')

pirate avatar Mar 25 '24 20:03 pirate

Can you screenshot the terminal running the command and getting this error Unexpected token '?'

(manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing error near unexpected token (')

I encountered the same issue while running in Docker. I used the command from the web logs:

/app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser --browser-args=[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--window-size=1440,2000\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--user-agent=Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Mobile Safari/537.36 Edg/126.0.0.0\", \"--window-size=1440,2000\", \"--user-data-dir=/tmp/test\"] https://google.com singlefile.html

The result was:

bash: syntax error near unexpected token `('

When I modified the command to:

/app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser --browser-args="[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--window-size=1440,2000\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--user-agent=Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Mobile Safari/537.36 Edg/126.0.0.0\", \"--window-size=1440,2000\", \"--user-data-dir=/tmp/test\"]" https://google.com singlefile.html

The program ran correctly. The difference lies in whether double quotes are used for the --browser-args parameter.

RuiQui avatar Jul 25 '24 13:07 RuiQui

That's a quirk of how the helper terminal commands outputted for debugging differ between docker/non-docker setups.

I believe currently those helper commands assume the user is running the command in docker, but as you encountered there is an extra layer of shell escaping that needs to be added/removed to handle quotes properly outside Docker.

I might improve it in a future release by changing the helper commands depending on if IS_DOCKER=True/False, but to be honest for now it's a low priority for me.

Given that single-file ran correctly for you when executed directly, but not when it's run within ArchiveBox, that points to the issue being either in ArchiveBox's code or in your dependency configuration.

Can you share the full output of the following commands so I can investigate further:

cd /mnt/media/ArchiveBox
archivebox config

cat package.json

which -a node
node --version

which -a single-file
/app/node_modules/single-file-cli/single-file --version

pirate avatar Aug 12 '24 21:08 pirate