Optimize any repeated calls to proto unmarshal and get signers in ante handlers
First, I would aggregate via a quick grep where we have unmarshal and GetSigners calls. Then I would probably use a generic cache to cache those values s.t. the cache can be passed between ante decorators.
there are other similar cases, for example in ethermint we have duplicated GetParams calls, which are not shared because in different decorators.
A generalized way would be to have an ante-handler "manager" that simple contains the entire chain and a cache. Then any call could look like:
v, ok := cache.Get(getParamsKey)
if ok {
params, ok = v.(Params)
if !ok {
// error
}
} else {
params = GetParams()
// set cache
}
Not pretty, but it's generalized.
lets focus on the changes from getsigner as this may a performance regression.
@elias-orijtech any chance you can prioritise this?
Sure. I think I need more detail, so I'll bring it up in today's call.
Thanks @elias-orijtech! We can keep it simple for now.
Is this urgent, or can it wait for my return from vacation start of August?
its kind of urgent to benchmark this, maybe @odeke-em can jump in? dont want to hold you from vacation
I took a look at this today, but got stuck on the deep dependency on pgregory.net/rapid. Details below.
At @kocubinski's suggestion, I'm aiming for converting aminojson tests for equivalence to benchmarks. For example,
https://github.com/cosmos/cosmos-sdk/blob/6afece635cef9f8e044a04ce67d06e55ca300249/tests/integration/tx/aminojson/aminojson_test.go#L94
however, the rapidproto.MessageGenerator depends on rapid for generating values, and rapid is designed for tests, not for benchmarks. rapid.Generator[T] does export an Example call that fits our needs, but unfortunately calling that results in a nil pointer panic when our rapidproto generator calls T.Helper. One such trace is
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x50 pc=0x10526e4c8]
goroutine 12 [running]:
pgregory.net/rapid.(*T).Helper(0x0?)
<autogenerated>:1 +0x28
gotest.tools/v3/assert.NilError({0x1079ae0c0, 0x140002f0300}, {0x0?, 0x0}, {0x0, 0x0, 0x0})
/Users/a/go/pkg/mod/gotest.tools/[email protected]/assert/assert.go:174 +0x64
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.genAny({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:267 +0x1b8
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.setFieldValue({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:159 +0x4cc
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.setFields({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:105 +0x3b4
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.setFieldValue({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:162 +0x424
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.setFields({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:105 +0x3b4
github.com/cosmos/cosmos-proto/rapidproto.MessageGenerator[...].func1()
/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:20 +0xc0
pgregory.net/rapid.(*customGen[...]).maybeValue(0x104f6928c, 0x1090af3c0?)
/Users/a/proj/orijtech/rapid/combinators.go:50 +0x64
pgregory.net/rapid.find[...](0x140010337c0?, 0x140002f0280, 0x5)
/Users/a/proj/orijtech/rapid/combinators.go:111 +0x80
pgregory.net/rapid.(*customGen[...]).value(0x104f5c1cc?, 0x140000b6888?)
/Users/a/proj/orijtech/rapid/combinators.go:36 +0x48
pgregory.net/rapid.(*Generator[...]).value(0x0, 0x140002f0280?)
/Users/a/proj/orijtech/rapid/generator.go:74 +0x64
pgregory.net/rapid.recoverValue[...](...)
/Users/a/proj/orijtech/rapid/generator.go:118
pgregory.net/rapid.example[...](0x0?, 0x1079ae120?)
/Users/a/proj/orijtech/rapid/generator.go:105 +0x2c
pgregory.net/rapid.(*Generator[...]).Example(0x1079cfc60, {0x140000b6b30, 0x1?, 0x140000b6bf8?})
/Users/a/proj/orijtech/rapid/generator.go:87 +0xf0
github.com/cosmos/cosmos-sdk/tests/integration/tx/aminojson.BenchmarkGetSignBytes.func1(0x14000447b80?)
/Users/a/proj/orijtech/cosmos-sdk/tests/integration/tx/aminojson/benchmark_test.go:66 +0x128
testing.(*B).runN(0x14000447b80, 0x1)
/nix/store/bvihxp3hnnvpjilgqw1srfqm4dyx6xa5-go-1.20.4/share/go/src/testing/benchmark.go:193 +0x12c
testing.(*B).run1.func1()
/nix/store/bvihxp3hnnvpjilgqw1srfqm4dyx6xa5-go-1.20.4/share/go/src/testing/benchmark.go:233 +0x50
created by testing.(*B).run1
/nix/store/bvihxp3hnnvpjilgqw1srfqm4dyx6xa5-go-1.20.4/share/go/src/testing/benchmark.go:226 +0x90
This doesn't seem immediately fixable, because rapid anonymously embeds the *testing.TB, presumably because testing.T.Helper is sensitive the call stack.
If anyone has any ideas on how to proceed, I have some time tomorrow as well.
Why do we need to use rapid at all for benchmarks?
We don't, but rapidproto is what conveniently generates messages that I assume is close the production messages. Is there another, more suitable, package for generating such messages?
I would just generate a message or two by hand. What we really want to see here is the time saved by repeated calls on the same method(s) on a given tx, so I don't imagine fuzzing a message type will show us a great deal of insight.
to avoid duplication could we not do something at compile time where we find all the messages and their signers so when messages come in we already know which value is the signer, this may help avoid any performance regression. cc @kocubinski
@odeke-em can you post your findings here so we can close this issue
Thanks @tac0turtle!
Synopsis
So some good news, from examining a bunch of profiles, we don't have regressions from repeated calls to proto.Unmarshal and we actually some improvements in the antehandler calls
Deep dive
From examining x/auth/ante.TestAnteHandlerMultiSigner in v0.47.0, there was a convolution and calls back and forth to ChainAnteDecorators from a bunch of other AnteHandlers due to every antehandler invoking .Next. Notice the arrows to and from types.ChainAnteDecorators showing repeated calls back and forth?
whereas in the current main, we've got only 2 repeated calls back and forth for ante.SigGasConsumeDecorator and ante.SigVerificationDecorator.AnteHandle as the only invoked antehandlers
That change I believe comes from this improvement
. . 367: // increment sequence of all signers
. . 368: signers, err := sigTx.GetSigners()
. . 369: if err != nil {
. . 370: return sdk.Context{}, err
. . 371: }
https://github.com/cosmos/cosmos-sdk/commit/82659a7477abf83e466df353b3a019ba90fe2b35#diff-b94baf116534b68c0bc48cb6f in which if the Tx's .GetSigners method returns a non-nil error it doesn't process any more and indeed the profile picture shows no other processing happens except calling the next ante handler so that's actually a great improvement in the latest release.
The only places I could see successive calls to proto.Unmarshal were in the obvious place: types.Any.Unpack* but those signatures have been like that since forever since we firstly have to invoke proto.Unmarshal(bz, anySave) to get the TypeURL which is then passed into the interfaceRegistry to match which value should be unpacked hence the second proto.Unmarshal invocation; in fact as I showed previously the new code doesn't unnecessarily call every antehandler
Wholesome transaction's lifetime profiling through simapp
I profiled simapp runs for simapp.TestFullAppSimulation, examined results and plenty of profiles between v0.47.0 and v0.50.X, there were 3 extra proto.Unmarshal calls in v0.50.X (139 calls) vs v0.47.0 (136 calls) but I didn't notice any repeated calls to proto.Unmarshal out of the ordinary
Verdict
I believe that we can go forth with the current release! However myself and @elias-orijtech should and shall be vigilant about getting continuous profiles and examining them then some continuous audits for performance improvements, it shaves the number of AnteHandlers called from 12 down to 2 just in that call for DeliverMsgs