zimit icon indicating copy to clipboard operation
zimit copied to clipboard

Use browser-generated text as IndexData

Open rgaudin opened this issue 2 years ago • 3 comments

WACZ includes a pages.jsonl file that contains a text property for every page (~HTML entries) that is extracted from the fully rendered DOM.

Using this as source for getIndexData() can be huge boost in quality for dynamic websites (building DOM in JS) versus the current situation in which the text is extracted solely from the HTML source code.

This is controlled by the --text option of the crawler.

From: https://github.com/openzim/warc2zim/issues/81

rgaudin avatar May 31 '23 15:05 rgaudin

Is this really mandatory for 2.0 ?

benoit74 avatar May 28 '24 12:05 benoit74

We must still keep a fallback to indexing HTML source code, since we cannot expect pages.jsonl to be always available (warc2zim must work from only a warc file, pages.jsonl is only available when warc2zim is using in conjunction with browsertrix crawler e.g. in zimit scraper)

benoit74 avatar Jun 18 '24 09:06 benoit74

I believe this is transparent: if you have index data in pages.jsonl, then you set the getIndexData() and if you don't it's not there and libzim will index as it currently does.

rgaudin avatar Jun 18 '24 10:06 rgaudin