ArchiveBox
ArchiveBox copied to clipboard
Bug: If document's title tag is empty, title extractor sets the snapshot title to "</title"
Describe the bug
I saved a webpage which is terribly coded by hand, and has an empty title tag. (Literally: `
TBH I'm not sure if you should care, since we may not care if horribly invalid documents create errors. But on the off chance that it's easy to check for and change this in the code, am filing bug report. (Perhaps such snapshots could be named "No document title found".)
Steps to reproduce
- Saved this page to ArchiveBox: http://wildwestcycle.com/f_oiltempdegradation.html
- Snapshot title is
</title
Screenshots or log output
ArchiveBox version
ArchiveBox v0.6.2
Cpython Linux Linux-4.4.302+-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep
[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9
√ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.64.0 valid /usr/bin/curl
√ WGET_BINARY v1.20.1 valid /usr/bin/wget
√ NODE_BINARY v15.14.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
- GIT_BINARY - disabled /usr/bin/git
- YOUTUBEDL_BINARY - disabled /usr/local/bin/youtube-dl
√ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium
√ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 22 files valid /app/archivebox
√ TEMPLATES_DIR 3 files valid /app/archivebox/templates
- CUSTOM_TEMPLATES_DIR - disabled
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled
- COOKIES_FILE - disabled
[i] Data locations:
√ OUTPUT_DIR 5 files valid /data
√ SOURCES_DIR 136 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 141 files valid ./archive
√ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 1.1 MB valid ./index.sqlite3
Hey, was this resolved?
Yeah should be, try the latest dev build https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch or v0.7.2
comment back if it's still happening and I'll re-open it