browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Parsing pages.jsonl can be slow/difficult due to presence of extracted full text

Open tw4l opened this issue 2 years ago • 0 comments
trafficstars

First pointed out in: https://github.com/webrecorder/browsertrix-crawler/issues/74#issuecomment-1087661811

https://github.com/webrecorder/browsertrix-crawler/pull/28 writes extracted full text into pages.jsonl, which makes that file quite large and difficult to parse. We may want to rethink where the extracted text is stored to alleviate this problem.

tw4l avatar Jan 19 '23 19:01 tw4l