Speed up import by pre-generating punctuation list
While profiling the startup performance of a project that depends on Mistletoe, I noticed that mistletoe.core_tokens took over 100ms to import. On further inspection, it's because it iterates over every unicode codepoint just to find all the punctuation.
That's simple and obviously correct, but it is needlessly slow. The fact that it runs so fast is a testament to the performance of modern computers. I replaced it with an uglier, but much faster to load, hardcoded set of characters.
@dvdkon, thanks for your contribution.
It looks like theoretically, provided that parsing and executing a longer Python code is faster than doing the same with the original code, this PR could really make mistletoe faster. Yet, when I run python test/benchmark.py mistletoe on my Windows machine, I can see no performance gain. So maybe the optimization possibly takes effect only under specific circumstances?
Another problem I see that if there is a change in the Unicode's "punctuation" category, which I don't know how often that might happen, we would also need to re-generate the hard-coded list.
It looks like theoretically, provided that parsing and executing a longer Python code is faster than doing the same with the original code, this PR could really make mistletoe faster. Yet, when I run
python test/benchmark.py mistletoeon my Windows machine, I can see no performance gain. So maybe the optimization possibly takes effect only under specific circumstances?
I took a look at benchmark.py and I think you don't see any benefit because the import of mistletoe only happens once, outside the measured region.
The simplest way to see the benefit is to measure how long import mistletoe takes, with maybe time python3 -c 'import mistletoe' on Linux or Measure-Command { python3 -c "import mistletoe" } in PowerShell on Windows (untested, taken from SO :) ).
@dvdkon, of course, I've overlooked that. :) So it really is much faster on my computer - before: ~350ms, after: ~140ms.
Wondering what other users of the mistletoe library think about this performance benefit - is it relevant to their use cases?
And another question remains - whether it is "safe" to have pre-generated Unicode punctuation. How do the alternative solutions approach this? Possibly at least add a visible note to the markdown documentation?