ArchiveBox
ArchiveBox copied to clipboard
Support: singlefile & readability fail to work
For every snapshot I try, singlefile and readability fail. I assume readability may fail due to lack of the singlefile.html.
Error for singlefile:
SingleFile was not able to archive the page
If I run the command it tries to run in terminal I get:
bash: syntax error near unexpected token `('
The raw command I copied from the log section and ran to get the above:
/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/ singlefile.html
This error occurs for every article I try to archive.
Readability error is:
Readability was not able to archive the page (invalid JSON)
But, again, since I see a reference to singlefile.html in the command I expect solving the above will solve this.
The output of archivebox version is:
0.7.2
ArchiveBox v0.7.2 BUILD_TIME=2024-02-28 11:27:51 1709137671
IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-101-generic-x86_64-with-glibc2.35 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=755
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False
[i] Dependency versions:
√ PYTHON_BINARY v3.10.12 valid /usr/bin/python3.10
√ SQLITE_BINARY v2.6.0 valid /usr/lib/python3.10/sqlite3/dbapi2.py
√ DJANGO_BINARY v3.1.14 valid /home/ahermitforhire/.local/lib/python3.10/site-packages/django/__init__.py
√ ARCHIVEBOX_BINARY v0.7.2 valid /home/ahermitforhire/.local/bin/archivebox
√ CURL_BINARY v7.81.0 valid /usr/bin/curl
√ WGET_BINARY v1.21.2 valid /usr/bin/wget
√ NODE_BINARY v12.22.9 valid /usr/bin/node
√ SINGLEFILE_BINARY v1.1.54 valid ./node_modules/single-file-cli/single-file
√ READABILITY_BINARY v0.0.11 valid ./node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/parser/cli.js
√ GIT_BINARY v2.34.1 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2023.12.30 valid /home/ahermitforhire/.local/bin/yt-dlp
√ CHROME_BINARY v122.0.6261.94 valid /usr/bin/chromium
√ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 23 files valid /home/ahermitforhire/.local/lib/python3.10/site-packages/archivebox
√ TEMPLATES_DIR 3 files valid /home/ahermitforhire/.local/lib/python3.10/site-packages/archivebox/templates
- CUSTOM_TEMPLATES_DIR - disabled None
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled None
- COOKIES_FILE - disabled None
[i] Data locations:
√ OUTPUT_DIR 8 files valid /mnt/media/ArchiveBox
√ SOURCES_DIR 11 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 5 files valid ./archive
√ CONFIG_FILE 238.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 328.0 KB valid ./index.sqlite3
Try running this:
/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium 'https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/' singlefile.html
But also you are archiving a URL that's already on the internet archive? You can try it but we don't really support that very well. You may want to follow this issue if you do that a lot: https://github.com/ArchiveBox/ArchiveBox/issues/160
If I do that in terminal I get:
Unexpected token '?'
Note: the error I described happens on ANY URL I try to add as mentioned in my initial post, not just archive.org links. For example:
/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://www.theguardian.com/us-news/2016/aug/30/us-national-parks-fire-lookout-forest-wildfire singlefile.html
Gets:
bash: syntax error near unexpected token `('
(I noticed in my initial post the code block removed the symbol before the parenthesis and I have edited to reflect that)
Also, I don't plan on using the terminal over the web interface to add new snapshots. The only reason I ran the command in terminal was to get more details of the error, so I'd like to see what can be done to solve this to enable the use of the web UI. Thanks!
Can you screenshot the terminal running the command and getting this error Unexpected token '?'
(manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing error near unexpected token (')
Can you screenshot the terminal running the command and getting this error
Unexpected token '?'(manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing
error near unexpected token (')
I encountered the same issue while running in Docker. I used the command from the web logs:
/app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser --browser-args=[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--window-size=1440,2000\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--user-agent=Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Mobile Safari/537.36 Edg/126.0.0.0\", \"--window-size=1440,2000\", \"--user-data-dir=/tmp/test\"] https://google.com singlefile.html
The result was:
bash: syntax error near unexpected token `('
When I modified the command to:
/app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser --browser-args="[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--window-size=1440,2000\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--user-agent=Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Mobile Safari/537.36 Edg/126.0.0.0\", \"--window-size=1440,2000\", \"--user-data-dir=/tmp/test\"]" https://google.com singlefile.html
The program ran correctly. The difference lies in whether double quotes are used for the --browser-args parameter.
That's a quirk of how the helper terminal commands outputted for debugging differ between docker/non-docker setups.
I believe currently those helper commands assume the user is running the command in docker, but as you encountered there is an extra layer of shell escaping that needs to be added/removed to handle quotes properly outside Docker.
I might improve it in a future release by changing the helper commands depending on if IS_DOCKER=True/False, but to be honest for now it's a low priority for me.
Given that single-file ran correctly for you when executed directly, but not when it's run within ArchiveBox, that points to the issue being either in ArchiveBox's code or in your dependency configuration.
Can you share the full output of the following commands so I can investigate further:
cd /mnt/media/ArchiveBox
archivebox config
cat package.json
which -a node
node --version
which -a single-file
/app/node_modules/single-file-cli/single-file --version