cosmos-sdk
cosmos-sdk copied to clipboard
[Feature]: Move away from Regexp parsing of denom validity / decimal amounts
Summary
Right now we use regular expressions for parsing coin denoms & decimal amounts. This has a couple drawbacks:
- More expensive runtime initialization (compiling the regex at init)
- Generally slower than other direct methods
Would be nice to replace this with more direct methods that are functionally equivalent. This blog post describes a tool "ragel" that could even code generate it: https://dgryski.medium.com/speeding-up-regexp-matching-with-ragel-4727f1c16027 , though honestly a dead simple for loop would probably perform better and be faster to write. (As we do very state-minimal matching)
Problem Definition
Makes coin denom matching faster, and pushes more work to compile time rather than runtime initializations.
Proposal
Change the regexp's usage in https://github.com/cosmos/cosmos-sdk/blob/main/types/coin.go#L833-L837 to direct matching methods.
Thanks @ValarDragon!
More expensive runtime initialization (compiling the regex at init)
This is sorta moot, no? Since it's a single fixed cost.
Generally slower than other direct methods
Yes, totally agree regex is not ideal for matching in terms of perf in hot paths of the code, which coins are! When you say dead simple loop, wdym exactly?
This is sorta moot, no? Since it's a single fixed cost.
It increases the cost of every CLI open, or opening time of any library importing the SDK. Its moot for state machine performance, not other contexts. (I think this is still a minor cost in the grand scheme of things FWIW)
Yes, totally agree regex is not ideal for matching in terms of perf in hot paths of the code, which coins are! When you say dead simple loop, wdym exactly?
The simplest looks something like this:
func isValidRune(r rune) bool {
return unicode.IsLetter(r) || unicode.IsDigit(r) || r == '/' || r == ':' || r == '.' || r == '_' || r == '-'
}
func isValidDenomination(s string) bool {
length := len(s)
if length < 3 || length > 128 {
return false
}
firstRune := rune(s[0])
if !unicode.IsLetter(firstRune) {
return false
}
for _, r := range s[1:] {
if !isValidRune(r) {
return false
}
}
return true
}
But on looking at it, it probably is better to just code generate something faster with ragel
It increases the cost of every CLI open, or opening time of any library importing the SDK. Its moot for state machine performance, not other contexts. (I think this is still a minor cost in the grand scheme of things FWIW)
Yeah, perhaps not moot but certainly not the overwhelming culprit.
LGTM!
Were seeing in osmosis block sync, that validate denom is taking about 1% of time within block sync. (And this is after we've spent hours removing Denom validation from hot loops)
Can you provide a profile? I'm curious as to why it's so costly? Once the regex is compiled (single time fixed cost), matching should be efficient, no?
Its mentioned a bit in the linked blog post:
How can you speed up regular expression matching in Go? As Rob Pike points out, the benefit of regular expressions is their dynamic nature. If you know what you’re going to be matching, you can do better. Sometimes doing better just means a for loop with an if check. ( Rob’s comments on regular expressions goes into more detail: https://commandcenter.blogspot.ca/2011/08/regular-expressions-in-lexing-and.html )
In general you want to avoid regex's in hot loops on really well structured data / for simple things.
Heres a link to the profile for validate denom:
This is in a 1000 block sync, with a not-that-high amount of swaps. In the state machine side of block processing, we took ~700 seconds. (We've reduced I/O demands a decent amount, which is why I was measuring this as out of 600 seconds when saying 1% before)
Each of our swaps does around 10 balance updates, in addition to using coins internally in operations. (Theres also protorev increasing the number of swap attempts, but not simulating the balance movement, so those are just Coin calls within swaps)
We also have things like SubUnlockedCoins
should never be calling regex's on the denoms, but is actually calling it 7 times.
I see, this makes a ton of sense given we know what the structure (denom) of the data looks like. Let's move to a more performant for loop.