protobuf
protobuf copied to clipboard
Enable the use of [SU]Int32Size and EnumSize templates for AArch64
Hi,
When benchmarking proto_benchmark from fleetbench on an AArch64 target we found that clang is able to vectorize these functions and they offer better performance than the scalar alternative.
I ran //src/google/protobuf:arena_unittest on aarch64-none-linux-gnu. Should I run any other tests? Also protobuf used to have its own set of benchmarks, but I can't find these when I query all targets with bazel. Let me know if you'd like me to run anything else, I couldn't find instructions on what the full test run is.
The benchmarks are now located in https://github.com/google/fleetbench
Rebased this as I thought that might be the cause for the 'Mergeable 1/1 Fail(s): LABEL', but that still seems to be there. Any idea what that's trying to tell me?
And @ckennelly, thanks. I used proto_benchmark off that repo to benchmark protobuf and it's how I found out we weren't using this path for Arm. The reason I asked about benchmarks was because I remembered the protobuf-repo ones were slightly different (I think they had smaller messages).
mergable fail is a safeguard to prevent google engineers from accidentally merging it without doing internal testing
For fleetbench we see 1-1.5% improvement on G2A instances
name old cpu/op new cpu/op delta
BM_Protogen_Arena 19.1ms ± 2% 18.9ms ± 3% -1.48% (p=0.000 n=36+45)
BM_Protogen_NoArena 20.6ms ± 1% 20.4ms ± 3% -1.18% (p=0.000 n=30+44)
For int32 size microbenchmarks we have something like 2x speed up
BM_RepeatedFieldSize/1 2.67ns ± 0% 5.39ns ± 0% +101.67% (p=0.000 n=495+552)
BM_RepeatedFieldSize/8 10.6ns ± 0% 6.5ns ± 1% -38.57% (p=0.000 n=515+581)
BM_RepeatedFieldSize/64 63.8ns ± 0% 30.0ns ± 0% -52.98% (p=0.000 n=550+438)
BM_RepeatedFieldSize/512 490ns ± 0% 216ns ± 0% -55.84% (p=0.000 n=563+547)
BM_RepeatedFieldSize/1k 981ns ± 0% 430ns ± 0% -56.15% (p=0.000 n=545+506)
Godbolt looks great. We managed to get a combination of cmhi and sub with usra when needed for >>31
. This is compared to clz + usra in the previous version. https://gcc.godbolt.org/z/W9bGd81KM
LGTM from the arm code generation side. Thanks a lot!
For int32 size microbenchmarks we have something like 2x speed up
BM_RepeatedFieldSize/1 2.67ns ± 0% 5.39ns ± 0% +101.67% (p=0.000 n=495+552)
How do you run the BM_RepeatedFieldSize Benchmarks?
For int32 size microbenchmarks we have something like 2x speed up
BM_RepeatedFieldSize/1 2.67ns ± 0% 5.39ns ± 0% +101.67% (p=0.000 n=495+552)
How do you run the BM_RepeatedFieldSize Benchmarks?
They are internal to us as we did not want to expose gbench at the time, I guess. Not sure about the current state, we probably should expose them.
The benchmark is simple, it just runs Int32Size on a random repeated field. We located it in protobuf/wire_format_unittest.cc
@avieira-arm could you please rebase on main? Sorry for the trouble, you've caught us in the middle of a migration to GitHub Actions.
I've rebased it but I've not been able to rebuild proto_benchmark due to some bazel stuff, so I'm hoping the CI can check whether it builds fine here.
While I have your attention, maybe you can help me resolve the issue I have with building the proto benchmark in fleetbench.
I build it overriding the com_google_protobuf, com_google_absl and com_google_tcmalloc repository's using local ones (that are unchanged, other than this protobuf patch). I get an error complaining the C++ version is too old, and when I check the build commands with --verbose_failures I see '-std=c++0x' is beeing passed. I can't find this option being added in any of the local repositories so I have to assume it is being added by some bazel rule that is being downloaded. I'm not super experienced with bazel and I've only done a minimal level of looking into this for now, in the past I hacked up the config files in abseil to get me past the errors, but the projects now actually use C++14 stuff so making these changes is no longer viable.
Looks like none of the tests are running because of some missing 'secrets'. I suspect this is an internal thing too?
I keep getting emails about workflows failing to run. I don't think theres much I can do about that though, can someone please confirm.
It would be nice to get this merged or dropped if you don't want it, but the benchmarks seem to indicate that it would be a desirable change :)
Rebased again.
sorry to be a pain but you are synced to a bad point and we need another rebase
Rebased again. Let me know if you still want this.
We definitely want this, this should be a very safe change. LGTM from me
@fowles
Sorry for all the delays on this one!