rustfmt icon indicating copy to clipboard operation
rustfmt copied to clipboard

Parallel formatting to increase speed

Open MarcusGrass opened this issue 1 year ago • 1 comments

Multithreaded parallelism

Closes https://github.com/rust-lang/rustfmt/issues/6091

This is non-trivial because when internals can run threaded, eagerly printing cannot happen, messages have to be passed back through buffers. Even that would be non-trivial if there wasn't color-term writing happening making things more complicated.

The change became pretty big, even though the only substantive change is in print.rs and bin/main.rs, because the printer has to be propagated deep into the library, about half of the changes are adding a function argument, and reformatting caused by those lines becoming too long. I'm sorry about that.

Changes

Output facilities

In practice there are four printing facilities used:

  1. Regular stdout, pretty easy, replace println with printing into a buffer.
  2. Regular stderr, same as above in all ways that matter.
  3. Term stdout, this happens in the diff-printing part of the code.
  4. Term stderr, this is done by rustc_error and the most complex to integrate.

Additionally, these four facilities can't be separated, since they have to preserve order between each other.

Rename StdoutEmitter

Confusing naming, it doesn't output into stdout, but into the provided buffer.

Pros

This change brings a substantial speedup, especially for large projects, the speedup becoming less and less but not none or negative for smaller projects.

If instant measures as the average reaction time of 250ms, this brings down tokio from the not-instant ~260ms, to the instant 86ms.

Giving space for projects almost 3x tokio's size to have an "instant" formatting experience.

Results locally on tokio with diferent parallelism:

1: Around 260ms, within noise of the master branch.
2: Around 160ms, 39% less time spent than the master branch.
4: Around 110ms, 58% less time spent than the master branch.
32: Around 85ms, 67% less time spent than the master branch.

The dramatic effect with just 2 cores means that this change doesn't just have an effect if you've got a threadripper, I don't have the data, but I would imagine that most users have 2 or more cores and would benefit.

Drawbacks

There are some drawbacks to consider

Late printing memory increase

All messages that would have gone into stdout or stderr and been flushed eagerly now go into an in-mem buffer. This causes a memory-overhead which scales with project size (number of files). For a large project (went from 1700ms to about 500ms with this change in cargo fmt --all -time), this is a difference of 117M on the master version, and 170M on this version (~+45%).

Rustc termcolor vendoring

Another problem is the way that rustc vendors termcolor. It vendors some types but no printing facilities, and the vendored code isn't interoperable with the published termcolor library, which means that either the printing facilities need to be reinvented (no), or some compatibility-code needs to be introduced (yes). Hopefully there's some better solution to this, but it's liveable at the moment see the rustc_compat* functions in print.rs.

Accidentally messing with outputs will cause an output jumble

This change would make any direct interaction with stdout and stderr in the formatting codepath a bug. Printing deep inside a library is usually inadvisable, but now that would be upgraded to a bug.

Some testing

Apart from the usual tests, I wanted to make sure that the rustc_errors printing is the same, pretty easy, just remove a semicolon: image And diffing: image Although there are tests for those already.

Lastly I did a test modifying a bunch of different files in a project and then ran cargo fmt --all --check, the order within-file is preserved, but the order in which diffs are printed is "random".

It's intended, and the reason for that is because output is printed by file in completion-order. It could await the join handles in order, and get the same order as the singe-threaded implementation, but then if there are more files than available parallelism, full thread utilization won't happen, which in practice measurably impacts performance on large projects.

It's possible to do a half-measure and wait in handle-order for the remaining files after all tasks have been scheduled, this would preserve ordering in projects with fewer files than available parallelism, but that doesn't seem very impactful to me, since available parallelism likely differs a lot.

MarcusGrass avatar Feb 26 '24 12:02 MarcusGrass

Thanks for the PR. Going to put this on hold until the team has had a chance to discuss this change https://github.com/rust-lang/rustfmt/issues/6091#issuecomment-1964539411.

ytmimi avatar Feb 26 '24 16:02 ytmimi