browsertrix-crawler Parsing pages.jsonl can be slow/difficult due to presence of extracted full text

Parsing pages.jsonl can be slow/difficult due to presence of extracted full text

Open tw4l opened this issue 2 years ago • 0 comments

trafficstars

First pointed out in: https://github.com/webrecorder/browsertrix-crawler/issues/74#issuecomment-1087661811

https://github.com/webrecorder/browsertrix-crawler/pull/28 writes extracted full text into pages.jsonl, which makes that file quite large and difficult to parse. We may want to rethink where the extracted text is stored to alleviate this problem.

Jan 19 '23 19:01 tw4l

browsertrix-crawler browsertrix-crawler copied to clipboard

Parsing pages.jsonl can be slow/difficult due to presence of extracted full text

browsertrix-crawler
browsertrix-crawler copied to clipboard