elm-format
elm-format copied to clipboard
Improve performance
Currently formatting 400 files takes ~2 seconds on a 2015 MBP. Can that time be cut in half?
Are the files used in this "benchmark" available anywhere to look at? (Maybe there's an outlier.)
Also, how does it happen that you format 400 files at once? Is it just for stress test purposes, or is there a workflow that formats the whole project at once? (I always assumed it's one file, on save.)
Nope, the files aren't available. So maybe the first task is to assemble a test suite. (Maybe bulk formatting the existing test suite would do?)
On the set of files I have, there is not single file that seems slow.
This results from bulk-formatting all files in a directory. But also is when running elm-format --validate either on CI or in scripts. I think this is also a valid use case for teams that want to run elm-format as a commit hook instead of on-save in their editors.
As indicated in https://github.com/avh4/elm-format/issues/408#issuecomment-478240826 I have a particular case with a somewhat large Elm file that is taking elm-format around 13 seconds to format.
I've tested this from the terminal using the following command:
> measure-command { npx elm-format .\src\Pages\PersonalInfoPage.elm --yes }
This yields the following result:
...
TotalSeconds : 13.8024885
...
Running the same command but for a different Elm source file which is only 55 lines long takes 767 milliseconds. While this is still slow it's still in the realm of feasibility in terms of an editor auto save experience.
What I'm wondering is whether there has been a performance regression Windows specifically or something along those lines with recent versions? I say this because I remember running Elm format in the past with no problems so either this is the first time I'm working with Elm source files this large or the performance of elm-format has regressed since earlier versions I've used in the past. These results above are using elm-format 0.8.1. Note that Evan has been recently advising others to starting with a file instead of splitting up files prematurely so that it's possible that other developers (perhaps only a subset on specific platforms) will began experiencing these issues more.
This is the profiling result of elm-format on elm-spa-example:
elm-format +RTS -p -RTS src/ src/Api/Endpoint.elm src/Article/Body.elm src/Article/Comment.elm src/Article/Feed.elm src/Article/Slug.elm src/Article/Tag.elm src/Page/Article.elm src/Page/Article/Editor.elm src/Page/Blank.elm src/Page/Home.elm src/Page/Login.elm src/Page/NotFound.elm src/Page/Profile.elm src/Page/Register.elm src/Page/Settings.elm --yes
total time = 2.85 secs (2847 ticks @ 1000 us, 1 processor)
total alloc = 3,382,876,296 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
string Text.Parsec.Char src/Text/Parsec/Char.hs:153:1-51 17.0 28.1
>>= Text.Parsec.Prim src/Text/Parsec/Prim.hs:272:5-28 15.2 15.4
mplus Text.Parsec.Prim src/Text/Parsec/Prim.hs:364:5-34 14.2 5.6
satisfy Text.Parsec.Char src/Text/Parsec/Char.hs:(142,1)-(144,71) 6.0 8.3
parsecMap.\ Text.Parsec.Prim src/Text/Parsec/Prim.hs:258:7-48 5.9 6.1
uncons Text.Parsec.Prim src/Text/Parsec/Prim.hs:(461,5)-(462,40) 4.6 7.5
labels.\ Text.Parsec.Prim src/Text/Parsec/Prim.hs:(433,5)-(438,39) 3.9 3.2
*> Text.Parsec.Prim src/Text/Parsec/Prim.hs:263:5-39 3.7 5.0
parserReturn.\ Text.Parsec.Prim src/Text/Parsec/Prim.hs:309:7-30 2.3 3.1
updateParserState.\ Text.Parsec.Prim src/Text/Parsec/Prim.hs:(814,5)-(815,34) 2.2 2.2
manyAccum.\.walk Text.Parsec.Prim src/Text/Parsec/Prim.hs:(683,9)-(688,41) 1.4 2.9
try.\ Text.Parsec.Prim src/Text/Parsec/Prim.hs:552:5-34 1.2 0.0
labels Text.Parsec.Prim src/Text/Parsec/Prim.hs:(431,1)-(445,48) 1.2 0.0
parsecMap Text.Parsec.Prim src/Text/Parsec/Prim.hs:(256,1)-(258,48) 1.1 0.0
labels.\.eerr' Text.Parsec.Prim src/Text/Parsec/Prim.hs:436:9-51 0.9 2.0
manyAccum.\ Text.Parsec.Prim src/Text/Parsec/Prim.hs:(683,5)-(689,61) 0.9 1.5
The most costly functions (both in terms of time and memory) are basic parsing primitives. Not a big surprise, but that means all of elm-format's formatting logic is actually fast (enough). I also tested this for some other projects. The pattern is consistent (parsing takes most time/memory).
My theory why parsing is slow is that Parsec actually does exactly what the new elm parser was built to avoid: copying from the parsed buffer. For instance a floating point number with a sign and exponent is really parsed as read $ concat [ "-", "123", ".", "32", "e", "10" ], so many small strings are allocated, then joined and given to read. The elm parser deals directly with a ByteString and does some unsafe stuff to read/parse the floating point value directly, omitting the allocation.
So from that it seems like a good idea to pull in elm's parser code. I'd be happy to take a look at this but it's a big project and requires coordination.
So from that it seems like a good idea to pull in elm's parser code. I'd be happy to take a look at this but it's a big project and requires coordination.
Yep, I definitely agree. An attempt was started to switch to the new parser, with the plan being to recreate parsec's primitives on top of the new parser first, and then switching to the new primitives could be done incrementally. That work was started here: https://github.com/avh4/elm-format/tree/parser19
However, there's currently a huge refactoring going on on the upgrade-tool branch, so I would recommend waiting for that to get merged (I'm hoping to finish it in the next 2 weeks) before trying to continue with the parser upgrade.
Also, compared to Elm's parser, elm-format has to retain comments and also allows a few syntax errors that it will automatically correct; it's very likely some of those parser changes have introduced unnecessary backtracking as I haven't been as rigorous as Evan about optimizing the parser performance, so even after migrating away from parsec there will likely still be some improvements the can be made to the parsing.
Another route would be to use tree sitter elm with the haskell bindings.
Pro:
- Tokens for comments
- Tokens for significant whitespaces
- Performance optimized
- Resilient (will parse an ast even if some leafs are invalid code)
- Should be easy to use from haskell with the provided bindings
Contra:
- Does not track every whitespace (but that info should still be in the parsed nodes if we look at node starts and ends)
- Will be a big refactor as the api probably is significant different from the earlier compiler
Regarding tree-sitter (or other parsing methods), the AST data types should provide an interface that should allow alternate parser implementations to be written and tested out without having to refactor the existing parser or the formatting code. (Though at the moment those AST data types are the subject of the current refactor on the upgrade-tool branch.) The upgrade-tool refactor is also making the AST types more general so, for instance, the same types and functions will be able to be used both with the standard AST and with modified ASTs (for example, the tree-sitter example where there might be unparsable chunks of text at certain locations in the AST).
Another contra for tree-sitter I did not think about before is that I only wrote the parser to handle 0.19.0. If we wanted to use tree sitter for that too, we would either need to lax the parser or do an additional 0.18.0 parser.