warc2zim
warc2zim copied to clipboard
Performance issue linked to new "extensible" HTML rewriting rules
For a very small WARC like https://github.com/openzim/warc2zim/blob/main/tests/data-special/qsl.net-encoding-alias.warc.gz, it takes more than 2 minutes to build the ZIM.
A flamegraph shows that most of the time is spent in the rewrite_html
(expected since the HTML page in this WARC is huge) but inside this most time is spent in inspect.signature
function.
This signature
information should in fact be cached since it is not going to change during a warc2zim execution.
A quick change (tbc in a PR) confirms that caching this information allows to return to coherent timings (less than 20 secs, with lot of time spent parsing the HTML which is expected since HTML is huge).