commonmark-hs
commonmark-hs copied to clipboard
Improve performance
See notes on performance in the README.md.
What I've tried
- rewriting to operate directly on Text instead of tokenizing first
- rewriting to operate directly on Text, using megaparsec instead of parsec, and using the fast parsers takeWhileP etc.
- rewriting to use ByteStrings instead of Texts in the Toks.
None of this achieved any speed improvement over the current version using [Tok]; indeed, in every case performance was worse.
Profiling reveals that block structure parsing is fast. Most of the time is taken up by tokenize and restOfLine (31%), and by inline parsing.
Instructions for profiling
make prof
Current results (March 12 2020):
1.8 parseChunks
2.1 pDelimChunk
2.2 Commonmark.Blocks.runInlineParser
2.5 blockContinues
2.6 Commonmark.Inlines.processBs
2.9 MAIN
3.9 block_starts
6.6 renderHtml
9.0 pSymbol
11.9 defaultInlineParser
17.5 Commonmark.Tokens.tokenize
32.6 restOfLine
For a 1.4MB file:
Benchmarks for different extensions:
| extension | mean |
|---|---|
| -xautolinks | 310.8 ms (309.3 ms .. 311.3 ms) |
| -xpipe_tables | 295.2 ms (293.2 ms .. 296.6 ms) |
| -xstrikethrough | 267.9 ms (265.6 ms .. 269.1 ms) |
| -xsuperscript | 267.8 ms (264.9 ms .. 269.5 ms) |
| -xsubscript | 266.8 ms (263.6 ms .. 267.9 ms) |
| -xsmart | 293.0 ms (292.0 ms .. 294.3 ms) |
| -xmath | 287.4 ms (285.4 ms .. 290.7 ms) |
| -xemoji | 281.6 ms (280.3 ms .. 282.8 ms) |
| -xfootnotes | 291.3 ms (286.1 ms .. 293.3 ms) |
| -xdefinition_lists | 272.6 ms (271.0 ms .. 275.4 ms) |
| -xfancy_lists | 271.2 ms (269.3 ms .. 273.8 ms) |
| -xattributes | 284.2 ms (283.4 ms .. 285.7 ms) |
| -xraw_attribute | 280.7 ms (279.6 ms .. 281.6 ms) |
| -xbracketed_spans | 268.5 ms (267.0 ms .. 269.4 ms) |
| -xfenced_divs | 269.6 ms (267.5 ms .. 271.6 ms) |
| -xauto_identifiers | 274.9 ms (273.0 ms .. 277.8 ms) |
| -ximplicit_heading_references | 269.8 ms (268.2 ms .. 272.8 ms) |
| -xall | 520.4 ms (515.5 ms .. 523.6 ms) |
One idea to explore: use ShortText from text-short package instead of Text in Tok.
The public API could still use Text.
This should reduce the memory used by the tokens.