cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

Optimize any repeated calls to proto unmarshal and get signers in ante handlers

Open tac0turtle opened this issue 2 years ago • 12 comments

tac0turtle avatar Jun 28 '23 14:06 tac0turtle

First, I would aggregate via a quick grep where we have unmarshal and GetSigners calls. Then I would probably use a generic cache to cache those values s.t. the cache can be passed between ante decorators.

alexanderbez avatar Jun 29 '23 14:06 alexanderbez

there are other similar cases, for example in ethermint we have duplicated GetParams calls, which are not shared because in different decorators.

yihuang avatar Jun 30 '23 01:06 yihuang

A generalized way would be to have an ante-handler "manager" that simple contains the entire chain and a cache. Then any call could look like:

v, ok := cache.Get(getParamsKey)
if ok {
  params, ok = v.(Params)
  if !ok {
    // error
  }
} else {
  params = GetParams()
  // set cache
}

Not pretty, but it's generalized.

alexanderbez avatar Jun 30 '23 16:06 alexanderbez

lets focus on the changes from getsigner as this may a performance regression.

@elias-orijtech any chance you can prioritise this?

tac0turtle avatar Jul 05 '23 07:07 tac0turtle

Sure. I think I need more detail, so I'll bring it up in today's call.

elias-orijtech avatar Jul 06 '23 10:07 elias-orijtech

Thanks @elias-orijtech! We can keep it simple for now.

alexanderbez avatar Jul 06 '23 16:07 alexanderbez

Is this urgent, or can it wait for my return from vacation start of August?

elias-orijtech avatar Jul 10 '23 11:07 elias-orijtech

its kind of urgent to benchmark this, maybe @odeke-em can jump in? dont want to hold you from vacation

tac0turtle avatar Jul 10 '23 12:07 tac0turtle

I took a look at this today, but got stuck on the deep dependency on pgregory.net/rapid. Details below.

At @kocubinski's suggestion, I'm aiming for converting aminojson tests for equivalence to benchmarks. For example,

https://github.com/cosmos/cosmos-sdk/blob/6afece635cef9f8e044a04ce67d06e55ca300249/tests/integration/tx/aminojson/aminojson_test.go#L94

however, the rapidproto.MessageGenerator depends on rapid for generating values, and rapid is designed for tests, not for benchmarks. rapid.Generator[T] does export an Example call that fits our needs, but unfortunately calling that results in a nil pointer panic when our rapidproto generator calls T.Helper. One such trace is

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x50 pc=0x10526e4c8]

goroutine 12 [running]:
pgregory.net/rapid.(*T).Helper(0x0?)
	<autogenerated>:1 +0x28
gotest.tools/v3/assert.NilError({0x1079ae0c0, 0x140002f0300}, {0x0?, 0x0}, {0x0, 0x0, 0x0})
	/Users/a/go/pkg/mod/gotest.tools/[email protected]/assert/assert.go:174 +0x64
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.genAny({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
	/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:267 +0x1b8
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.setFieldValue({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
	/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:159 +0x4cc
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.setFields({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
	/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:105 +0x3b4
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.setFieldValue({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
	/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:162 +0x424
github.com/cosmos/cosmos-proto/rapidproto.GeneratorOptions.setFields({{0x140006d0d00, 0x1, 0x1}, 0x14000e10ae0, {0x107998310, 0x1400011e3f0}, 0x0, 0x1, {0x109019750, 0x1, ...}}, ...)
	/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:105 +0x3b4
github.com/cosmos/cosmos-proto/rapidproto.MessageGenerator[...].func1()
	/Users/a/go/pkg/mod/github.com/cosmos/[email protected]/rapidproto/rapidproto.go:20 +0xc0
pgregory.net/rapid.(*customGen[...]).maybeValue(0x104f6928c, 0x1090af3c0?)
	/Users/a/proj/orijtech/rapid/combinators.go:50 +0x64
pgregory.net/rapid.find[...](0x140010337c0?, 0x140002f0280, 0x5)
	/Users/a/proj/orijtech/rapid/combinators.go:111 +0x80
pgregory.net/rapid.(*customGen[...]).value(0x104f5c1cc?, 0x140000b6888?)
	/Users/a/proj/orijtech/rapid/combinators.go:36 +0x48
pgregory.net/rapid.(*Generator[...]).value(0x0, 0x140002f0280?)
	/Users/a/proj/orijtech/rapid/generator.go:74 +0x64
pgregory.net/rapid.recoverValue[...](...)
	/Users/a/proj/orijtech/rapid/generator.go:118
pgregory.net/rapid.example[...](0x0?, 0x1079ae120?)
	/Users/a/proj/orijtech/rapid/generator.go:105 +0x2c
pgregory.net/rapid.(*Generator[...]).Example(0x1079cfc60, {0x140000b6b30, 0x1?, 0x140000b6bf8?})
	/Users/a/proj/orijtech/rapid/generator.go:87 +0xf0
github.com/cosmos/cosmos-sdk/tests/integration/tx/aminojson.BenchmarkGetSignBytes.func1(0x14000447b80?)
	/Users/a/proj/orijtech/cosmos-sdk/tests/integration/tx/aminojson/benchmark_test.go:66 +0x128
testing.(*B).runN(0x14000447b80, 0x1)
	/nix/store/bvihxp3hnnvpjilgqw1srfqm4dyx6xa5-go-1.20.4/share/go/src/testing/benchmark.go:193 +0x12c
testing.(*B).run1.func1()
	/nix/store/bvihxp3hnnvpjilgqw1srfqm4dyx6xa5-go-1.20.4/share/go/src/testing/benchmark.go:233 +0x50
created by testing.(*B).run1
	/nix/store/bvihxp3hnnvpjilgqw1srfqm4dyx6xa5-go-1.20.4/share/go/src/testing/benchmark.go:226 +0x90

This doesn't seem immediately fixable, because rapid anonymously embeds the *testing.TB, presumably because testing.T.Helper is sensitive the call stack.

If anyone has any ideas on how to proceed, I have some time tomorrow as well.

elias-orijtech avatar Jul 11 '23 14:07 elias-orijtech

Why do we need to use rapid at all for benchmarks?

alexanderbez avatar Jul 11 '23 15:07 alexanderbez

We don't, but rapidproto is what conveniently generates messages that I assume is close the production messages. Is there another, more suitable, package for generating such messages?

elias-orijtech avatar Jul 12 '23 09:07 elias-orijtech

I would just generate a message or two by hand. What we really want to see here is the time saved by repeated calls on the same method(s) on a given tx, so I don't imagine fuzzing a message type will show us a great deal of insight.

alexanderbez avatar Jul 12 '23 17:07 alexanderbez

to avoid duplication could we not do something at compile time where we find all the messages and their signers so when messages come in we already know which value is the signer, this may help avoid any performance regression. cc @kocubinski

tac0turtle avatar Jul 16 '23 19:07 tac0turtle

@odeke-em can you post your findings here so we can close this issue

tac0turtle avatar Aug 11 '23 09:08 tac0turtle

Thanks @tac0turtle!

Synopsis

So some good news, from examining a bunch of profiles, we don't have regressions from repeated calls to proto.Unmarshal and we actually some improvements in the antehandler calls

Deep dive

From examining x/auth/ante.TestAnteHandlerMultiSigner in v0.47.0, there was a convolution and calls back and forth to ChainAnteDecorators from a bunch of other AnteHandlers due to every antehandler invoking .Next. Notice the arrows to and from types.ChainAnteDecorators showing repeated calls back and forth? image whereas in the current main, we've got only 2 repeated calls back and forth for ante.SigGasConsumeDecorator and ante.SigVerificationDecorator.AnteHandle as the only invoked antehandlers image

That change I believe comes from this improvement

         .          .    367:	// increment sequence of all signers
         .          .    368:	signers, err := sigTx.GetSigners()
         .          .    369:	if err != nil {
         .          .    370:		return sdk.Context{}, err
         .          .    371:	}

https://github.com/cosmos/cosmos-sdk/commit/82659a7477abf83e466df353b3a019ba90fe2b35#diff-b94baf116534b68c0bc48cb6f in which if the Tx's .GetSigners method returns a non-nil error it doesn't process any more and indeed the profile picture shows no other processing happens except calling the next ante handler so that's actually a great improvement in the latest release.

The only places I could see successive calls to proto.Unmarshal were in the obvious place: types.Any.Unpack* but those signatures have been like that since forever since we firstly have to invoke proto.Unmarshal(bz, anySave) to get the TypeURL which is then passed into the interfaceRegistry to match which value should be unpacked hence the second proto.Unmarshal invocation; in fact as I showed previously the new code doesn't unnecessarily call every antehandler

Wholesome transaction's lifetime profiling through simapp

I profiled simapp runs for simapp.TestFullAppSimulation, examined results and plenty of profiles between v0.47.0 and v0.50.X, there were 3 extra proto.Unmarshal calls in v0.50.X (139 calls) vs v0.47.0 (136 calls) but I didn't notice any repeated calls to proto.Unmarshal out of the ordinary

Verdict

I believe that we can go forth with the current release! However myself and @elias-orijtech should and shall be vigilant about getting continuous profiles and examining them then some continuous audits for performance improvements, it shaves the number of AnteHandlers called from 12 down to 2 just in that call for DeliverMsgs

odeke-em avatar Aug 13 '23 12:08 odeke-em