browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Full URL in WORKER log

Open KathrynN opened this issue 4 years ago • 1 comments

Background: When crawling a website, it is not uncommon to see output like

   #0 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
  #1 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
   #2 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
   #3 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN
   #4 WORK http://resource.history.org.ua/cgi-bin/eiu/history.exe?&I21DBN=EJRN

This is because the URL seems to be truncated. This makes it hard to see whether these URLs are actually different content or whether they are all the same with different ?param=1 fields that all redirect to the same page and can be avoided with a simple tweak to the config.yaml

DoD: Implement an argument that allows the full url to be printed, even it it takes up new lines

(By the way, excellent, excellent work on this easy to use docker image!)

KathrynN avatar Mar 05 '22 17:03 KathrynN

+1, but full untruncated URLs might make following the output rather difficult as the lines wrap and jump around.

As an alternative option while this is considered, you can tail the collections/<collection>/pages/pages.jsonl file to see the urls as they're written. If the JSON output is distracting, you can pipe to jq .url or jq -r .url. Finally, at the cost of a bit more complexity, you can also decode URL encoded chars as they are written to the screen, for greater intelligibility. Full example:

tail -f pages.jsonl | stdbuf -oL jq .url | { while read i; do echo -e "${i//\%/\\x}"; done; }

simonwiles avatar Mar 17 '22 23:03 simonwiles