substrate
substrate copied to clipboard
Improve base weights consistency and make sure they're never zero
Fixes https://github.com/paritytech/substrate/issues/11804
Here's my attempt at fixing the way we generate our base weights. The fix is relatively straightforward:
- Grab the minimum time it takes to run the extrinsic.
- From all of the benchmark results substract that minimum.
- Run the linear regression with the intercept value forced to
y=0. - Set the previously calculated minimum time as the base weight.
Here's a sample of how the benchmark results look before this PR:
pallet_contracts/seal_transfer,328713000
pallet_contracts/seal_transfer/r,3335054000
pallet_contracts/seal_transfer,230924000
pallet_contracts/seal_transfer/r,3319470000
pallet_contracts/seal_transfer,281081000
pallet_contracts/seal_transfer/r,3316840000
pallet_contracts/seal_transfer,277473000
pallet_contracts/seal_transfer/r,3386931000
pallet_contracts/seal_transfer,517279000
pallet_contracts/seal_transfer/r,3350051000
pallet_contracts/seal_transfer,193294000
pallet_contracts/seal_transfer/r,3361404000
pallet_contracts/seal_transfer,137453000
pallet_contracts/seal_transfer/r,3380590000
pallet_contracts/seal_transfer,0
pallet_contracts/seal_transfer/r,3452650000
As you can see the base weights are all over the place, and we even have one zero here.
And here's how they look after this PR:
pallet_contracts/seal_transfer,349462000
pallet_contracts/seal_transfer/r,2550642000
pallet_contracts/seal_transfer,349162000
pallet_contracts/seal_transfer/r,2511251000
pallet_contracts/seal_transfer,350352000
pallet_contracts/seal_transfer/r,2537639000
pallet_contracts/seal_transfer,348642000
pallet_contracts/seal_transfer/r,2497103000
pallet_contracts/seal_transfer,349471000
pallet_contracts/seal_transfer/r,2519135000
pallet_contracts/seal_transfer,348562000
pallet_contracts/seal_transfer/r,2466470000
There might be better ways of doing this. This PR is also unfinished and should not be merged as-is - it obviously needs to be actually properly tested to make sure this isn't subtly broken is some way; some tests also still need fixing, and the code to print out errors needs to be added back.
I'm opening this mostly to discuss whether we actually want to fix it in this way, before I invest more time into it.
The steps make sense to me, and seem to not break any assumptions for the benchmarking pipeline overall.
Set the previously calculated minimum time as the base weight.
Doesn't this step assume that the first benchmark run is the absolute minimum weight the extrinsic could be?
I don't think that is always true, for example, I could create a benchmark which runs between [10, 100], but actually 0 is a valid starting point, and less weight than 10.
This might just be bad benchmark design, but a consideration here.
The steps make sense to me, and seem to not break any assumptions for the benchmarking pipeline overall.
Great! I was very interested in seeing whether this makes sense to you before I continue with it.
Set the previously calculated minimum time as the base weight.
Doesn't this step assume that the first benchmark run is the absolute minimum weight the extrinsic could be?
I don't think that is always true, for example, I could create a benchmark which runs between [10, 100], but actually 0 is a valid starting point, and less weight than 10.
This might just be bad benchmark design, but a consideration here.
Well, yeah, it just picks the minimum of all of the runs. Perhaps not ideal, but it's the simplest and the most straightforward way of fixing this that I could think of.
It is a good point that someone might only benchmark for, say, 10, but have the extrinsic accept a 0. That, said:
- Do we know how prevalent something like that is? Do we know how many such benchmarks we have? Do we know any parachain which has such benchmarks?
- Since one of the main objectives of actually having the weights is to prevent attackers from exhausting a chain's resources I imagine it's better to be more conservative than not, and when in doubt to just err on the side of caution. Or in other words, when given the choice of a bad benchmark reducing your chain's throughput because it won't accept as many extrinsics, or a bad benchmark making your chain vulnerable to a denial of service attack because it will accept too many, I imagine the former should be preferable?
If such benchmarks are not widespread I think it should suffice to just update our docs regarding benchmarking to make it explicit how the base weights are calculated, and in the (hopefully?) rare case it happens to just eat the cost.
We have a quite a lot of benchmarks with components that do not start at 0.
You can find them in the Polkadot repo with the regex The range of component `\w+` is `\[[^0].
It will report more spots than benchmarks, since it searches for the component ranges.
Some good examples:
/// The range of component `v` is `[1000, 2000]`.
/// The range of component `t` is `[500, 1000]`.
/// The range of component `d` is `[5, 16]`.
fn phragmms(v: u32, _t: u32, d: u32, ) -> Weight {
...
}
or
/// The range of component `v` is `[1000, 2000]`.
/// The range of component `t` is `[500, 1000]`.
fn create_snapshot_internal(v: u32, t: u32, ) -> Weight {
...
}
Maybe we can get around this by only applying your intercept-shift-logic if the linear regression reports a negative intercept?
We have a quite a lot of benchmarks with components that do not start at
0.
So what was the rationale to not run those benchmarks with zero values?
Maybe we can get around this by only applying your intercept-shift-logic if the linear regression reports a negative intercept?
Hmm... well, in some cases that could work, but imagine that the linear regression returns an intercept of, say, 1 (which is possible), and then essentially we'll have the same problem. So I'm not entirely sure what the best course of action is here...
There's one more problem I've found when digging through the numbers; the slopes can also be zero for some benchmarks! Behold, the benchmark for frame_system/remark when ran on my machine:

The dots are real samples from the benchmark, while the line on the bottom is the cost calculated using the weights. This happens because the slope here is calculated as 0.4102529741049291, which is rounded down to zero. When I regenerate the graph with the proper slope it looks like this:

So, uh, this is pretty bad. Weight calculation shouldn't randomly break down like this.
So what was the rationale to not run those benchmarks with zero values?
In many cases, the 0 value makes no sense. For example, creating a membership group with 0 users. The underlying runtime code probably would reject this extrinsic.
Regarding the slope issue, you must note that when looking at the raw, unaltered data, is is basically zero slope:
https://www.shawntabrizi.com/substrate-graph-benchmarks/old/?p=system&e=remark
Yes, the benchmarking process is truncating really small slopes to zero, but also these effects are nearly zero when compared to the underlying weights here.
Im not against ideas which can bubble these things up accurately, but we should not also say that it is "break"ing. The weights we are generating, even for thinks like remark, are good for what they are used for.
Somehow I also see a linear timing for System::Remark :confused:, while the linear regression reports 0.
cargo r --release --features=runtime-benchmarks --bin substrate -- benchmark pallet --dev --steps=50 --repeat=20 --pallet="frame_system" '--extrinsic=remark' --execution=wasm --wasm-execution=compiled
produces on reference hardware:
Data points distribution: [10/685]
b mean µs sigma µs %
0 5.419 0.1 1.8%
78643 34.82 0.072 0.2%
157286 63.95 0.1 0.1%
235929 90.53 0.083 0.0%
314572 121.4 0.137 0.1%
393215 150.1 0.176 0.1%
471858 177.3 1.756 0.9%
550501 203.3 0.149 0.0%
629144 231.6 0.161 0.0%
707787 259.8 0.128 0.0%
786430 288.2 0.445 0.1%
865073 316.2 0.145 0.0%
943716 344.4 0.156 0.0%
1022359 380.3 3.213 0.8%
1101002 401.3 0.502 0.1%
1179645 429.3 0.458 0.1%
1258288 457.6 0.191 0.0%
1336931 486.3 1.577 0.3%
1415574 513.9 0.164 0.0%
1494217 542.2 0.296 0.0%
1572860 570.5 0.26 0.0%
1651503 598.8 0.216 0.0%
1730146 634.6 7.387 1.1%
1808789 659.5 10.44 1.5%
1887432 684.4 0.928 0.1%
1966075 712.9 0.922 0.1%
2044718 741.3 0.74 0.0%
2123361 769.1 0.239 0.0%
2202004 798 0.777 0.0%
2280647 832 11.92 1.4%
2359290 865.5 9.494 1.0%
2437933 884.1 1.061 0.1%
2516576 912.2 0.983 0.1%
2595219 942.8 2.526 0.2%
2673862 993.1 14.54 1.4%
2752505 1012 13.67 1.3%
2831148 1028 4.116 0.4%
2909791 1060 7.656 0.7%
2988434 1091 13.12 1.2%
3067077 1113 1.849 0.1%
3145720 1147 11.67 1.0%
3224363 1172 3.371 0.2%
3303006 1200 2.122 0.1%
3381649 1230 4.119 0.3%
3460292 1259 4.096 0.3%
3538935 1288 3.585 0.2%
3617578 1317 4.995 0.3%
3696221 1347 4.183 0.3%
3774864 1377 9.218 0.6%
3853507 1409 8.801 0.6%
3932150 1435 4.814 0.3%
The timing values seem to increase from 5.4µs at b=0 to 1_435µs at maximum b=3932150.
That is quite the difference. Obviously a small slope that we can normally ignore can start to make a difference when it is multiplied with 4 million...
So what was the rationale to not run those benchmarks with zero values?
In many cases, the 0 value makes no sense. For example, creating a membership group with 0 users. The underlying runtime code probably would reject this extrinsic.
So can we assume that if a parameter of zero doesn't make sense and the extrinsic would reject it then in normal use it wouldn't be called with such a parameter anyway? (Or at least wouldn't be called often if the caller is non-malicious.) In which case does it really matter that the base weight would be higher than necessary? I know that in general there is no such guarantee, but it'd make sense to me to basically require that extrinsics are benchmarked with full range of expected inputs to ensure optimal weights, and if you don't benchmark with the full range then you might get slightly higher weights, but you won't risk opening yourself to a potential DoS attack.
Yes, the benchmarking process is truncating really small slopes to zero, but also these effects are nearly zero when compared to the underlying weights here.
Im not against ideas which can bubble these things up accurately, but we should not also say that it is "break"ing. The weights we are generating, even for thinks like remark, are good for what they are used for.
Sorry, I might have used a too strong of a word here. My point was more that it is a subtle landmine that someone could step on without realizing it, since whether this is a problem or not will depend on the exact range the input component can have.
Regarding the slope issue, you must note that when looking at the raw, unaltered data, is is basically zero slope:
Yes, I was looking at the raw, unaltered data. (: And if you only look at the first 16k of the input component it is indeed basically zero and doesn't matter, but the benchmark goes up to 4 million (as @ggwpez confirmed, thanks!) where it seems to start to matter.
Again, whether this is actually a problem in practice for this particular extrinsic - I haven't checked, but even if it isn't it doesn't change the fact that it is a potential subtle correctness issue that might bite someone somewhere without them realizing.
So I see two ways we could fix this: either always round up the weights to 1, or just use higher precision weights. Since our weights are multiplied by 1000 anyway I'm guessing the second option would be preferable?
@koute either solutions sound good to me.
I am still mostly in the camp that the current system is "good enough", we have plenty of full integration benchmarks which show that they are quite accurate (and in the cases they are not perfect, always pessimistic), but happy to have you find these further improvements that can be made :)
Okay, I've increased the precision with which the weights are returned. So now the analysis code returns the extrinsic times' already essentially premultiplied by 1000 instead of multiplying them in the writer.
From my side I think this should be good to go?
As a sanity check here is an example diff comparing weights' outputs before and after this PR:
@@ -62,27 +62,27 @@
impl<T: frame_system::Config> WeightInfo for SubstrateWeight<T> {
// Storage: Recovery Proxy (r:1 w:0)
fn as_recovered() -> Weight {
- (10_050_000 as Weight)
+ (10_110_000 as Weight)
.saturating_add(T::DbWeight::get().reads(1 as Weight))
}
// Storage: Recovery Proxy (r:0 w:1)
fn set_recovered() -> Weight {
- (20_610_000 as Weight)
+ (20_140_000 as Weight)
.saturating_add(T::DbWeight::get().writes(1 as Weight))
}
// Storage: Recovery Recoverable (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn create_recovery(n: u32, ) -> Weight {
- (41_123_000 as Weight)
- // Standard Error: 14_000
- .saturating_add((155_000 as Weight).saturating_mul(n as Weight))
+ (39_920_000 as Weight)
+ // Standard Error: 5_800
+ .saturating_add((182_163 as Weight).saturating_mul(n as Weight))
.saturating_add(T::DbWeight::get().reads(1 as Weight))
.saturating_add(T::DbWeight::get().writes(1 as Weight))
}
// Storage: Recovery Recoverable (r:1 w:0)
// Storage: Recovery ActiveRecoveries (r:1 w:1)
fn initiate_recovery() -> Weight {
- (48_000_000 as Weight)
+ (47_270_000 as Weight)
.saturating_add(T::DbWeight::get().reads(2 as Weight))
.saturating_add(T::DbWeight::get().writes(1 as Weight))
}
@@ -90,9 +90,9 @@
// Storage: Recovery ActiveRecoveries (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn vouch_recovery(n: u32, ) -> Weight {
- (31_502_000 as Weight)
- // Standard Error: 15_000
- .saturating_add((271_000 as Weight).saturating_mul(n as Weight))
+ (30_050_000 as Weight)
+ // Standard Error: 7_087
+ .saturating_add((334_068 as Weight).saturating_mul(n as Weight))
.saturating_add(T::DbWeight::get().reads(2 as Weight))
.saturating_add(T::DbWeight::get().writes(1 as Weight))
}
@@ -101,9 +101,9 @@
// Storage: Recovery Proxy (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn claim_recovery(n: u32, ) -> Weight {
- (41_489_000 as Weight)
- // Standard Error: 16_000
- .saturating_add((192_000 as Weight).saturating_mul(n as Weight))
+ (40_270_000 as Weight)
+ // Standard Error: 6_915
+ .saturating_add((229_951 as Weight).saturating_mul(n as Weight))
.saturating_add(T::DbWeight::get().reads(3 as Weight))
.saturating_add(T::DbWeight::get().writes(1 as Weight))
}
@@ -111,9 +111,9 @@
// Storage: System Account (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn close_recovery(n: u32, ) -> Weight {
- (47_169_000 as Weight)
- // Standard Error: 16_000
- .saturating_add((240_000 as Weight).saturating_mul(n as Weight))
+ (46_000_000 as Weight)
+ // Standard Error: 8_005
+ .saturating_add((296_745 as Weight).saturating_mul(n as Weight))
.saturating_add(T::DbWeight::get().reads(2 as Weight))
.saturating_add(T::DbWeight::get().writes(2 as Weight))
}
@@ -121,15 +121,15 @@
// Storage: Recovery Recoverable (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn remove_recovery(n: u32, ) -> Weight {
- (44_697_000 as Weight)
- // Standard Error: 16_000
- .saturating_add((196_000 as Weight).saturating_mul(n as Weight))
+ (43_940_000 as Weight)
+ // Standard Error: 9_009
+ .saturating_add((169_804 as Weight).saturating_mul(n as Weight))
.saturating_add(T::DbWeight::get().reads(2 as Weight))
.saturating_add(T::DbWeight::get().writes(1 as Weight))
}
// Storage: Recovery Proxy (r:1 w:1)
fn cancel_recovered() -> Weight {
- (17_090_000 as Weight)
+ (17_141_000 as Weight)
.saturating_add(T::DbWeight::get().reads(1 as Weight))
.saturating_add(T::DbWeight::get().writes(1 as Weight))
}
@@ -139,27 +139,27 @@
impl WeightInfo for () {
// Storage: Recovery Proxy (r:1 w:0)
fn as_recovered() -> Weight {
- (10_050_000 as Weight)
+ (10_110_000 as Weight)
.saturating_add(RocksDbWeight::get().reads(1 as Weight))
}
// Storage: Recovery Proxy (r:0 w:1)
fn set_recovered() -> Weight {
- (20_610_000 as Weight)
+ (20_140_000 as Weight)
.saturating_add(RocksDbWeight::get().writes(1 as Weight))
}
// Storage: Recovery Recoverable (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn create_recovery(n: u32, ) -> Weight {
- (41_123_000 as Weight)
- // Standard Error: 14_000
- .saturating_add((155_000 as Weight).saturating_mul(n as Weight))
+ (39_920_000 as Weight)
+ // Standard Error: 5_800
+ .saturating_add((182_163 as Weight).saturating_mul(n as Weight))
.saturating_add(RocksDbWeight::get().reads(1 as Weight))
.saturating_add(RocksDbWeight::get().writes(1 as Weight))
}
// Storage: Recovery Recoverable (r:1 w:0)
// Storage: Recovery ActiveRecoveries (r:1 w:1)
fn initiate_recovery() -> Weight {
- (48_000_000 as Weight)
+ (47_270_000 as Weight)
.saturating_add(RocksDbWeight::get().reads(2 as Weight))
.saturating_add(RocksDbWeight::get().writes(1 as Weight))
}
@@ -167,9 +167,9 @@
// Storage: Recovery ActiveRecoveries (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn vouch_recovery(n: u32, ) -> Weight {
- (31_502_000 as Weight)
- // Standard Error: 15_000
- .saturating_add((271_000 as Weight).saturating_mul(n as Weight))
+ (30_050_000 as Weight)
+ // Standard Error: 7_087
+ .saturating_add((334_068 as Weight).saturating_mul(n as Weight))
.saturating_add(RocksDbWeight::get().reads(2 as Weight))
.saturating_add(RocksDbWeight::get().writes(1 as Weight))
}
@@ -178,9 +178,9 @@
// Storage: Recovery Proxy (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn claim_recovery(n: u32, ) -> Weight {
- (41_489_000 as Weight)
- // Standard Error: 16_000
- .saturating_add((192_000 as Weight).saturating_mul(n as Weight))
+ (40_270_000 as Weight)
+ // Standard Error: 6_915
+ .saturating_add((229_951 as Weight).saturating_mul(n as Weight))
.saturating_add(RocksDbWeight::get().reads(3 as Weight))
.saturating_add(RocksDbWeight::get().writes(1 as Weight))
}
@@ -188,9 +188,9 @@
// Storage: System Account (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn close_recovery(n: u32, ) -> Weight {
- (47_169_000 as Weight)
- // Standard Error: 16_000
- .saturating_add((240_000 as Weight).saturating_mul(n as Weight))
+ (46_000_000 as Weight)
+ // Standard Error: 8_005
+ .saturating_add((296_745 as Weight).saturating_mul(n as Weight))
.saturating_add(RocksDbWeight::get().reads(2 as Weight))
.saturating_add(RocksDbWeight::get().writes(2 as Weight))
}
@@ -198,15 +198,15 @@
// Storage: Recovery Recoverable (r:1 w:1)
/// The range of component `n` is `[1, 9]`.
fn remove_recovery(n: u32, ) -> Weight {
- (44_697_000 as Weight)
- // Standard Error: 16_000
- .saturating_add((196_000 as Weight).saturating_mul(n as Weight))
+ (43_940_000 as Weight)
+ // Standard Error: 9_009
+ .saturating_add((169_804 as Weight).saturating_mul(n as Weight))
.saturating_add(RocksDbWeight::get().reads(2 as Weight))
.saturating_add(RocksDbWeight::get().writes(1 as Weight))
}
// Storage: Recovery Proxy (r:1 w:1)
fn cancel_recovered() -> Weight {
- (17_090_000 as Weight)
+ (17_141_000 as Weight)
.saturating_add(RocksDbWeight::get().reads(1 as Weight))
.saturating_add(RocksDbWeight::get().writes(1 as Weight))
}
And here's an example diff of what the benchmarking prints out on the stdout:
Pallet: "pallet_recovery", Extrinsic: "remove_recovery", Lowest values: [], Highest values: [], Steps: 50, Repeat: 20
Raw Storage Info
========
Storage: Recovery ActiveRecoveries (r:1 w:0)
Storage: Recovery Recoverable (r:1 w:1)
Median Slopes Analysis
========
-- Extrinsic Time --
Model:
-Time ~= 44.66
- + n 0.188
+Time ~= 44.38
+ + n 0.078
µs
Reads = 2 + (0 * n)
Writes = 1 + (0 * n)
Min Squares Analysis
========
-- Extrinsic Time --
Data points distribution:
n mean µs sigma µs %
- 1 44.97 0.183 0.4%
- 2 44.75 0.207 0.4%
- 3 45.64 0.173 0.3%
- 4 45.48 0.288 0.6%
- 5 45.94 0.536 1.1%
- 6 45.51 0.134 0.2%
- 7 45.82 0.244 0.5%
- 8 46.18 0.213 0.4%
- 9 46.74 0.408 0.8%
+ 1 44.52 0.142 0.3%
+ 2 44.68 0.135 0.3%
+ 3 44.92 0.364 0.8%
+ 4 44.39 0.24 0.5%
+ 5 44.61 0.131 0.2%
+ 6 44.75 0.27 0.6%
+ 7 44.66 0.102 0.2%
+ 8 45.8 0.661 1.4%
+ 9 45.42 0.377 0.8%
Quality and confidence:
param error
-n 0.016
+n 0.009
Model:
-Time ~= 44.69
- + n 0.196
+Time ~= 43.94
+ + n 0.169
µs
Reads = 2 + (0 * n)
Writes = 1 + (0 * n)
@koute can you bump the version of the crate?
Otherwise, looking good, will review
@koute can you bump the version of the crate?
Sure. Do you mean version of frame-benchmarking (from 4.0.0-dev to 6.0.0-dev)? Or do you mean the linregress dependency (since they apparently released 0.5 a week ago)?
Otherwise, looking good, will review
Thanks!
The precision increase to pico-seconds is good, thanks.
About the does-not-start-at-zero benchmarks, I dont think there is a straightforward solution.
Our standard approach would be to round it up to zero, since then we over-estimate and are therefore safe.
bot rebase
Rebased
/cmd queue -c bench-bot $ pallet dev pallet_contracts
@ggwpez https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1805446 was started for your command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/substrate/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.
Comment /cmd cancel 45-bf695fe2-d5f3-4753-9107-ca5b6deb3bb9 to cancel this command or /cmd cancel to cancel all commands in this pull request.
@ggwpez Command "$PIPELINE_SCRIPTS_DIR/bench-bot.sh" pallet dev pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1805446 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/substrate/-/jobs/1805446/artifacts/download.
bot merge
Waiting for commit status.