warc2zim icon indicating copy to clipboard operation
warc2zim copied to clipboard

Performance issue linked to new "extensible" HTML rewriting rules

Open benoit74 opened this issue 6 months ago • 0 comments

For a very small WARC like https://github.com/openzim/warc2zim/blob/main/tests/data-special/qsl.net-encoding-alias.warc.gz, it takes more than 2 minutes to build the ZIM.

A flamegraph shows that most of the time is spent in the rewrite_html (expected since the HTML page in this WARC is huge) but inside this most time is spent in inspect.signature function.

qsl_flame

This signature information should in fact be cached since it is not going to change during a warc2zim execution.

A quick change (tbc in a PR) confirms that caching this information allows to return to coherent timings (less than 20 secs, with lot of time spent parsing the HTML which is expected since HTML is huge).

qsl_flame_cached

benoit74 avatar Aug 05 '24 12:08 benoit74