zinc icon indicating copy to clipboard operation
zinc copied to clipboard

New analysis format

Open szeiger opened this issue 1 year ago • 5 comments

A new implementation of Zinc's incremental state serialization.

  • Full structural serialization (like the existing protobuf format), no shortcuts with sbinary or Java serialization (like the existing text format).
  • A single implementation that supports an efficient binary format for production use and a text format for development and debugging.
  • Consistent output files: If two compiler runs result in the same internal representation of incremental state (after applying WriteMappers), they produce identical zinc files. This is important for build tools like Bazel where skipping a build entirely when the outputs are identical is much cheaper than having to run Zinc to perform the incremental state analysis.
  • Smaller output files than the existing binary format.
  • Faster serialization and deserialization than the existing binary format.
  • Smaller implementation than either of the existing formats.
  • Optional unsorted output that trades consistency and small file sizes for much faster writing.

Benchmark data based on scala-library + reflect + compiler:

Write time Read time File size
sbt Text 1002 ms 791 ms ~ 7102 kB
sbt Binary 654 ms 277 ms ~ 6182 kB
ConsistentBinary 157 ms 100 ms 3097 kB
ConsistentBinary (unsorted) 79 ms ~ 3796 kB

This PR makes the new format available via the new ConsistentFileAnalysisStore. It does not replace the existing formats (but it should; it's a better choice for almost every use case).

We have been using iterations of this format internally over the last few months for the Bazel build (with our own Zinc-based tooling) of our two main monorepos totaling about 27000 Scala (+ Java/mixed) targets ranging in size from a few LOC to almost 1 million LOC.

szeiger avatar Jan 05 '24 18:01 szeiger

The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git.

szeiger avatar Jan 05 '24 18:01 szeiger

The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git.

Maybe we can store them via Git LFS?

I have never used Git LFS, so ideally people with prior experience with it can judge if this is indeed a good idea.

Friendseeker avatar Jan 05 '24 22:01 Friendseeker

I don't understand why the benchmarks are failing in CI. I added the new class to the existing benchmark project so it should be found. I was not able to reproduce the problem locally. The benchmark commands from CI run just fine on my machine.

szeiger avatar Jan 10 '24 15:01 szeiger

Oh, it's the "benchmark against develop branch" step that fails. This seems to be the usual undercompilation bug of sbt-jmh. It usually needs an explicit clean when you remove a benchmark. Switching to the develop branch after building from this PR would cause it. This problem should go away after merging, but explicitly cleaning the jmh project would be a better fix.

szeiger avatar Jan 10 '24 15:01 szeiger

How about : https://github.com/HebiRobotics/QuickBuffers and https://github.com/google/flatbuffers

For the formats? seems fast too.

He-Pin avatar Jan 18 '24 06:01 He-Pin

bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use

lihaoyi-databricks avatar Mar 18 '24 01:03 lihaoyi-databricks

bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use

Overall I am onboard with more stable / hermetic analysis serializer. My main line of concern was where the speedup gain was coming from (ExecutionContext.global) and if it would end up competing for CPU attention in non-Bazel usage. Now that ExecutionContext is a parameter, hopefully we can tweak how much thread we assign to this vs other tasks.

eed3si9n avatar Mar 18 '24 02:03 eed3si9n

@eed3si9n would you have time to do a 1.10.0-M4 release that includes this PR?

lrytz avatar Apr 05 '24 08:04 lrytz

Yea. I'll try to get something out this weekend.

eed3si9n avatar Apr 05 '24 13:04 eed3si9n

https://github.com/sbt/zinc/releases/tag/v1.10.0-RC1 is on its way to Maven Central.

eed3si9n avatar Apr 08 '24 09:04 eed3si9n