cardano-ledger
cardano-ledger copied to clipboard
Optimize transaction generators
In their downstream uses in ouroboros-consensus
, we've found the Shelley transaction generators are slow. From this excerpt from the .prof
summary, it looks like they're allocating a lot and hence are slow.
COST CENTRE MODULE SRC %time %alloc
choose Test.QuickCheck.Gen Test/QuickCheck/Gen.hs:130:1-59 28.3 36.6
genTx Test.Shelley.Spec.Ledger.Generator.Utxo src/Test/Shelley/Spec/Ledger/Generator/Utxo.hs:(170,1)-(237,86) 4.5 4.2
shuffle Test.QuickCheck.Gen Test/QuickCheck/Gen.hs:(217,1)-(219,55) 3.9 6.9
By accident, we didn't retain enough info to identify the specific problematic calls, but the genTx
is at least incriminated. Here's the excerpt from the .prof
body:
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
genTx Test.Shelley.Spec.Ledger.Generator.Utxo src/Test/Shelley/Spec/Ledger/Generator/Utxo.hs:(170,1)-(237,86) 111572 19825 4.5 4.2 47.2 66.1
shuffle Test.QuickCheck.Gen Test/QuickCheck/Gen.hs:(217,1)-(219,55) 111593 133890 3.9 6.9 36.7 56.8
choose Test.QuickCheck.Gen Test/QuickCheck/Gen.hs:130:1-59 111599 0 28.1 36.4 30.2 42.2
So it's the shuffle
s in genTx
. They're allocating "too much".
At a glance, I think multiple of the uses of shuffle
could be manually fused. For example:
https://github.com/input-output-hk/cardano-ledger-specs/blob/befbf160ed840d2d6ad843902c9f668507dc7838/shelley/chain-and-ledger/shelley-spec-ledger-test/src/Test/Shelley/Spec/Ledger/Generator/Utxo.hs#L559-L560
In this one, the utxo
list is probably quite large. And then we immediately project out just a few (I think maxNumGenInputs
is just 10
here.). So just generating 10 indices without replacement and then projecting those would avoid massive allocation (I think).
Hopefully there are similarly straight-forward opportunities for the other shuffle
uses and whatever else follow-up profiling identifies.
@uroboros added a change which gave a 2x speedup here, but it seems that the re-organisation caused a 10x slowdown. So we still have more to do here.
To avoid potential confusion: the 10x number I fed to Clarke was measured as the affect on the (downstream) Shelley ThreadNet tests' runtime -- I don't have intuition for how a 2x upstream improvement will manifest downstream. Could be good, but I'm guessing it'll be deflated (similar reasoning as Amdahl's Law).