browsertrix-crawler
browsertrix-crawler copied to clipboard
Parsing pages.jsonl can be slow/difficult due to presence of extracted full text
trafficstars
First pointed out in: https://github.com/webrecorder/browsertrix-crawler/issues/74#issuecomment-1087661811
https://github.com/webrecorder/browsertrix-crawler/pull/28 writes extracted full text into pages.jsonl, which makes that file quite large and difficult to parse. We may want to rethink where the extracted text is stored to alleviate this problem.