protobuf Enable the use of [SU]Int32Size and EnumSize templates for AArch64

Hi,

When benchmarking proto_benchmark from fleetbench on an AArch64 target we found that clang is able to vectorize these functions and they offer better performance than the scalar alternative.

I ran //src/google/protobuf:arena_unittest on aarch64-none-linux-gnu. Should I run any other tests? Also protobuf used to have its own set of benchmarks, but I can't find these when I query all targets with bazel. Let me know if you'd like me to run anything else, I couldn't find instructions on what the full test run is.

Nov 30 '22 22:11 avieira-arm

The benchmarks are now located in https://github.com/google/fleetbench

Dec 01 '22 16:12 ckennelly

Rebased this as I thought that might be the cause for the 'Mergeable 1/1 Fail(s): LABEL', but that still seems to be there. Any idea what that's trying to tell me?

And @ckennelly, thanks. I used proto_benchmark off that repo to benchmark protobuf and it's how I found out we weren't using this path for Arm. The reason I asked about benchmarks was because I remembered the protobuf-repo ones were slightly different (I think they had smaller messages).

Jan 06 '23 17:01 avieira-arm

mergable fail is a safeguard to prevent google engineers from accidentally merging it without doing internal testing

Jan 06 '23 17:01 fowles

For fleetbench we see 1-1.5% improvement on G2A instances

name                 old cpu/op  new cpu/op  delta
BM_Protogen_Arena    19.1ms ± 2%  18.9ms ± 3%  -1.48%  (p=0.000 n=36+45)
BM_Protogen_NoArena  20.6ms ± 1%  20.4ms ± 3%  -1.18%  (p=0.000 n=30+44)

For int32 size microbenchmarks we have something like 2x speed up

BM_RepeatedFieldSize/1        2.67ns ± 0%  5.39ns ± 0%  +101.67%      (p=0.000 n=495+552)
BM_RepeatedFieldSize/8        10.6ns ± 0%   6.5ns ± 1%   -38.57%      (p=0.000 n=515+581)
BM_RepeatedFieldSize/64       63.8ns ± 0%  30.0ns ± 0%   -52.98%      (p=0.000 n=550+438)
BM_RepeatedFieldSize/512       490ns ± 0%   216ns ± 0%   -55.84%      (p=0.000 n=563+547)
BM_RepeatedFieldSize/1k        981ns ± 0%   430ns ± 0%   -56.15%      (p=0.000 n=545+506)

Godbolt looks great. We managed to get a combination of cmhi and sub with usra when needed for >>31. This is compared to clz + usra in the previous version. https://gcc.godbolt.org/z/W9bGd81KM

LGTM from the arm code generation side. Thanks a lot!

Jan 08 '23 16:01 danlark1

For int32 size microbenchmarks we have something like 2x speed up
BM_RepeatedFieldSize/1        2.67ns ± 0%  5.39ns ± 0%  +101.67%      (p=0.000 n=495+552)

How do you run the BM_RepeatedFieldSize Benchmarks?

Jan 09 '23 13:01 avieira-arm

For int32 size microbenchmarks we have something like 2x speed up
BM_RepeatedFieldSize/1        2.67ns ± 0%  5.39ns ± 0%  +101.67%      (p=0.000 n=495+552)
How do you run the BM_RepeatedFieldSize Benchmarks?

They are internal to us as we did not want to expose gbench at the time, I guess. Not sure about the current state, we probably should expose them.

The benchmark is simple, it just runs Int32Size on a random repeated field. We located it in protobuf/wire_format_unittest.cc

Jan 09 '23 15:01 danlark1

@avieira-arm could you please rebase on main? Sorry for the trouble, you've caught us in the middle of a migration to GitHub Actions.

Feb 16 '23 21:02 haberman

I've rebased it but I've not been able to rebuild proto_benchmark due to some bazel stuff, so I'm hoping the CI can check whether it builds fine here.

While I have your attention, maybe you can help me resolve the issue I have with building the proto benchmark in fleetbench.

I build it overriding the com_google_protobuf, com_google_absl and com_google_tcmalloc repository's using local ones (that are unchanged, other than this protobuf patch). I get an error complaining the C++ version is too old, and when I check the build commands with --verbose_failures I see '-std=c++0x' is beeing passed. I can't find this option being added in any of the local repositories so I have to assume it is being added by some bazel rule that is being downloaded. I'm not super experienced with bazel and I've only done a minimal level of looking into this for now, in the past I hacked up the config files in abseil to get me past the errors, but the projects now actually use C++14 stuff so making these changes is no longer viable.

Feb 20 '23 14:02 avieira-arm

Looks like none of the tests are running because of some missing 'secrets'. I suspect this is an internal thing too?

Feb 23 '23 10:02 avieira-arm

I keep getting emails about workflows failing to run. I don't think theres much I can do about that though, can someone please confirm.

It would be nice to get this merged or dropped if you don't want it, but the benchmarks seem to indicate that it would be a desirable change :)

Mar 09 '23 11:03 avieira-arm

Rebased again.

Apr 03 '23 15:04 avieira-arm

sorry to be a pain but you are synced to a bad point and we need another rebase

Apr 25 '23 18:04 fowles

Rebased again. Let me know if you still want this.

May 15 '23 18:05 avieira-arm

We definitely want this, this should be a very safe change. LGTM from me

@fowles

May 15 '23 18:05 danlark1

Sorry for all the delays on this one!

May 17 '23 14:05 fowles

protobuf protobuf copied to clipboard

Enable the use of [SU]Int32Size and EnumSize templates for AArch64

protobuf
protobuf copied to clipboard