ArchiveBox icon indicating copy to clipboard operation
ArchiveBox copied to clipboard

Bug: If document's title tag is empty, title extractor sets the snapshot title to "</title"

Open rmohns opened this issue 2 years ago • 2 comments

Describe the bug

I saved a webpage which is terribly coded by hand, and has an empty title tag. (Literally: `

'.) The resulting snapshot is named "</title". Easy to change but odd.

TBH I'm not sure if you should care, since we may not care if horribly invalid documents create errors. But on the off chance that it's easy to check for and change this in the code, am filing bug report. (Perhaps such snapshots could be named "No document title found".)

Steps to reproduce

  1. Saved this page to ArchiveBox: http://wildwestcycle.com/f_oiltempdegradation.html
  2. Snapshot title is </title

Screenshots or log output

Screenshot 2023-08-29 at 5 05 12 PM

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-4.4.302+-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data                                                                       
 √  SOURCES_DIR           136 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           141 files       valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             1.1 MB          valid     ./index.sqlite3                                                             

rmohns avatar Aug 29 '23 21:08 rmohns

Hey, was this resolved?

i-am-pluto avatar Oct 24 '23 09:10 i-am-pluto

Yeah should be, try the latest dev build https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch or v0.7.2

comment back if it's still happening and I'll re-open it

pirate avatar Oct 25 '23 21:10 pirate