Antithesis found linearization issue
Bug report criteria
- [ ] This bug report is not security related, security issues should be disclosed privately via [email protected].
- [ ] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- [ ] You have read the etcd bug reporting guidelines.
- [ ] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
Looks like one member revision has not progressed for long time, but still it accepted reads and writes.
Antithesis report: https://linuxfoundation.antithesis.com/report/wo0cw7nDtn0iw5FGwqcQw74OwqTDsCHCmWnChA/OiREqJT0lJ844qOJlgc9cH7B1ZQPzFkGta7JuosTUkI.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA3LTI4VDIyOjA2OjA0LjU0OTc5NjYyOVoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiT2lSRXFKVDBsSjg0NHFPSmxnYzljSDdCMVpRUHpGa0d0YTdKdW9zVFVrSS5odG1sIiwicmVwb3J0X2lkIjoid28wY3c3bkR0bjBpdzVGR3dxY1F3NzRPd3FURHNDSENtV25DaEEifX19XfWyBtmDif1hjpNeYKknXOwpr59Clf0Rxu6I9i_DIA3LEMr0aqIyyPjTUnk5riEPMy5TJTIlmjuqJ4I7yGTuAQ#/run/00b4cc5e35594f082f0c353952397cd3-34-6/finding/5a95b2983bca202814eaa6a3fe594910a72cd2c6
var_report_dump.tar.gz /cc @fuweid @ahrtr @nwnt @henrybear327 @joshjms @siyuanfoundation
What did you expect to happen?
No linearization issues
How can we reproduce it (as minimally and precisely as possible)?
contact @marcus-hodgson-antithesis
Anything else we need to know?
No response
Etcd version (please run commands below)
main branch
Etcd configuration (command line flags or environment variables)
ETCD_SNAPSHOT_CATCHUP_ENTRIES: 100 ETCD_SNAPSHOT_COUNT: 50 ETCD_COMPACTION_BATCH_LIMIT: 10
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
Clarification on the issue
The issue description isn't user friendly. I guess some contributors might not get what's the exact issue. The put request (by client-4 on "key26") get a response with rev 3, which doesn't match the watch response (which has revision 155) received by other clients
In client-4/watch.json
{
"Events": [
{
"Type": "put-operation",
"Key": "key26",
"Value": {
"Value": "147"
},
"Revision": 3,
"IsCreate": true
}
],
"Revision": 3,
"Time": 36609617769
}
In client-6/operations.json
{
"Key": "key26",
"Value": {
"Value": "147"
},
"ModRevision": 3,
"Version": 1
}
In other client-x/watch.json
"Events": [
{
"Type": "put-operation",
"Key": "key26",
"Value": {
"Value": "147"
},
"Revision": 155
}
],
Did not see any issue on data (DB and WAL files)
I checked the data (including db and WAL files), did not see any issues (except the missing tombstone revision explained in next section).
$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd0/member/snap/db key --decode
rev={Revision:{Main:176 Sub:0} tombstone:false}, value=[key "key17" | val "167" | created 90 | mod 176 | ver 8]
rev={Revision:{Main:175 Sub:0} tombstone:false}, value=[key "key19" | val "166" | created 102 | mod 175 | ver 11]
rev={Revision:{Main:174 Sub:0} tombstone:true}, value=[key "key2" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:173 Sub:0} tombstone:false}, value=[key "key17" | val "165" | created 90 | mod 173 | ver 7]
rev={Revision:{Main:172 Sub:0} tombstone:false}, value=[key "key27" | val "164" | created 172 | mod 172 | ver 1]
rev={Revision:{Main:171 Sub:2} tombstone:false}, value=[key "key16" | val "163" | created 113 | mod 171 | ver 9]
rev={Revision:{Main:171 Sub:1} tombstone:true}, value=[key "key25" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:171 Sub:0} tombstone:false}, value=[key "key19" | val "162" | created 102 | mod 171 | ver 10]
rev={Revision:{Main:170 Sub:1} tombstone:true}, value=[key "key27" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:170 Sub:0} tombstone:false}, value=[key "key23" | val "160" | created 126 | mod 170 | ver 5]
rev={Revision:{Main:169 Sub:0} tombstone:false}, value=[key "key16" | val "161" | created 113 | mod 169 | ver 8]
rev={Revision:{Main:168 Sub:0} tombstone:false}, value=[key "key24" | val "159" | created 140 | mod 168 | ver 6]
rev={Revision:{Main:166 Sub:0} tombstone:false}, value=[key "key25" | val "157" | created 153 | mod 166 | ver 2]
rev={Revision:{Main:165 Sub:0} tombstone:false}, value=[key "key21" | val "156" | created 105 | mod 165 | ver 7]
rev={Revision:{Main:164 Sub:0} tombstone:false}, value=[key "key19" | val "155" | created 102 | mod 164 | ver 9]
rev={Revision:{Main:161 Sub:0} tombstone:false}, value=[key "key17" | val "152" | created 90 | mod 161 | ver 6]
rev={Revision:{Main:158 Sub:0} tombstone:false}, value=[key "key18" | val "150" | created 114 | mod 158 | ver 8]
rev={Revision:{Main:150 Sub:0} tombstone:false}, value=[key "key2" | val "142" | created 144 | mod 150 | ver 2]
$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd1/member/snap/db key --decode
rev={Revision:{Main:176 Sub:0} tombstone:false}, value=[key "key17" | val "167" | created 90 | mod 176 | ver 8]
rev={Revision:{Main:175 Sub:0} tombstone:false}, value=[key "key19" | val "166" | created 102 | mod 175 | ver 11]
rev={Revision:{Main:174 Sub:0} tombstone:true}, value=[key "key2" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:173 Sub:0} tombstone:false}, value=[key "key17" | val "165" | created 90 | mod 173 | ver 7]
rev={Revision:{Main:172 Sub:0} tombstone:false}, value=[key "key27" | val "164" | created 172 | mod 172 | ver 1]
rev={Revision:{Main:171 Sub:2} tombstone:false}, value=[key "key16" | val "163" | created 113 | mod 171 | ver 9]
rev={Revision:{Main:171 Sub:1} tombstone:true}, value=[key "key25" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:171 Sub:0} tombstone:false}, value=[key "key19" | val "162" | created 102 | mod 171 | ver 10]
rev={Revision:{Main:170 Sub:1} tombstone:true}, value=[key "key27" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:170 Sub:0} tombstone:false}, value=[key "key23" | val "160" | created 126 | mod 170 | ver 5]
rev={Revision:{Main:169 Sub:0} tombstone:false}, value=[key "key16" | val "161" | created 113 | mod 169 | ver 8]
rev={Revision:{Main:168 Sub:0} tombstone:false}, value=[key "key24" | val "159" | created 140 | mod 168 | ver 6]
rev={Revision:{Main:166 Sub:0} tombstone:false}, value=[key "key25" | val "157" | created 153 | mod 166 | ver 2]
rev={Revision:{Main:165 Sub:0} tombstone:false}, value=[key "key21" | val "156" | created 105 | mod 165 | ver 7]
rev={Revision:{Main:164 Sub:0} tombstone:false}, value=[key "key19" | val "155" | created 102 | mod 164 | ver 9]
rev={Revision:{Main:161 Sub:0} tombstone:false}, value=[key "key17" | val "152" | created 90 | mod 161 | ver 6]
rev={Revision:{Main:158 Sub:0} tombstone:false}, value=[key "key18" | val "150" | created 114 | mod 158 | ver 8]
rev={Revision:{Main:150 Sub:0} tombstone:false}, value=[key "key2" | val "142" | created 144 | mod 150 | ver 2]
F5WPP69Q7H:etcd-dump-db wachao$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd2/member/snap/db key --decode
rev={Revision:{Main:176 Sub:0} tombstone:false}, value=[key "key17" | val "167" | created 90 | mod 176 | ver 8]
rev={Revision:{Main:175 Sub:0} tombstone:false}, value=[key "key19" | val "166" | created 102 | mod 175 | ver 11]
rev={Revision:{Main:174 Sub:0} tombstone:true}, value=[key "key2" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:173 Sub:0} tombstone:false}, value=[key "key17" | val "165" | created 90 | mod 173 | ver 7]
rev={Revision:{Main:172 Sub:0} tombstone:false}, value=[key "key27" | val "164" | created 172 | mod 172 | ver 1]
rev={Revision:{Main:171 Sub:2} tombstone:false}, value=[key "key16" | val "163" | created 113 | mod 171 | ver 9]
rev={Revision:{Main:171 Sub:1} tombstone:true}, value=[key "key25" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:171 Sub:0} tombstone:false}, value=[key "key19" | val "162" | created 102 | mod 171 | ver 10]
rev={Revision:{Main:170 Sub:1} tombstone:true}, value=[key "key27" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:170 Sub:0} tombstone:false}, value=[key "key23" | val "160" | created 126 | mod 170 | ver 5]
rev={Revision:{Main:169 Sub:0} tombstone:false}, value=[key "key16" | val "161" | created 113 | mod 169 | ver 8]
rev={Revision:{Main:168 Sub:0} tombstone:false}, value=[key "key24" | val "159" | created 140 | mod 168 | ver 6]
rev={Revision:{Main:166 Sub:0} tombstone:false}, value=[key "key25" | val "157" | created 153 | mod 166 | ver 2]
rev={Revision:{Main:165 Sub:0} tombstone:false}, value=[key "key21" | val "156" | created 105 | mod 165 | ver 7]
rev={Revision:{Main:164 Sub:0} tombstone:false}, value=[key "key19" | val "155" | created 102 | mod 164 | ver 9]
rev={Revision:{Main:161 Sub:0} tombstone:false}, value=[key "key17" | val "152" | created 90 | mod 161 | ver 6]
rev={Revision:{Main:158 Sub:0} tombstone:false}, value=[key "key18" | val "150" | created 114 | mod 158 | ver 8]
rev={Revision:{Main:150 Sub:0} tombstone:false}, value=[key "key2" | val "142" | created 144 | mod 150 | ver 2]
$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd0/member/snap/db meta --decode
key="term", value=5
key="storageVersion", value="3.7.0"
key="scheduledCompactRev", value={170 0}
key="finishedCompactRev", value={170 0}
key="consistent_index", value=238
key="confState", value="{\"voters\":[3201069280892892151,11090874293180130209,15904053625288548798],\"auto_leave\":false}"
$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd1/member/snap/db meta --decode
key="term", value=9
key="storageVersion", value="3.7.0"
key="scheduledCompactRev", value={170 0}
key="finishedCompactRev", value={170 0}
key="consistent_index", value=242
key="confState", value="{\"voters\":[3201069280892892151,11090874293180130209,15904053625288548798],\"auto_leave\":false}"
$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd2/member/snap/db meta --decode
key="term", value=9
key="storageVersion", value="3.7.0"
key="scheduledCompactRev", value={170 0}
key="finishedCompactRev", value={170 0}
key="consistent_index", value=242
key="confState", value="{\"voters\":[3201069280892892151,11090874293180130209,15904053625288548798],\"auto_leave\":false}"
F5WPP69Q7H:etcd-dump-db wachao$
F5WPP69Q7H:etcd-dump-db wachao$
$ ./etcd-dump-logs ~/Downloads/var/report/server-etcd0/
Snapshot:
term=5 index=204 nodes=[2c6c7b8d0d0933f7 99eab3685d8363a1 dcb68c82481661be] confstate={"voters":[3201069280892892151,11090874293180130209,15904053625288548798],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=2c6c7b8d0d0933f7 clusterID=3f51d2f720a184e6 term=9 commitIndex=240 vote=dcb68c82481661be
WAL entries: 38
lastIndex=242
term index type data
5 205 norm header:<ID:3744629086422271679 > put:<key:"key17" value:"146" >
5 206 norm header:<ID:3744629086422271680 > put:<key:"key26" value:"147" >
5 207 norm header:<ID:7043234338494374897 > put:<key:"key24" value:"148" >
5 208 norm header:<ID:7179186752245548303 > compaction:<revision:151 >
5 209 norm header:<ID:7043234338494374899 > put:<key:"key21" value:"149" >
5 210 norm header:<ID:3744629086422271685 > lease_revoke:<ID:3744629086422271649 >
5 211 norm header:<ID:7043234338494374900 > lease_revoke:<ID:7043234338494374771 >
5 212 norm header:<ID:3744629086422271686 > put:<key:"key18" value:"150" >
5 213 norm header:<ID:7179186752245548305 > delete_range:<key:"key26" >
5 214 norm header:<ID:3744629086422271687 > put:<key:"key16" value:"151" >
5 215 norm header:<ID:7043234338494374901 > compaction:<revision:151 >
5 216 norm header:<ID:3744629086422271688 > put:<key:"key17" value:"152" >
5 217 norm header:<ID:3744629086422271689 > put:<key:"key16" value:"153" >
5 218 norm header:<ID:7043234338494374905 > put:<key:"key19" value:"154" >
5 219 norm header:<ID:3744629086422271690 > compaction:<revision:159 >
5 220 norm header:<ID:3744629086422271692 > put:<key:"key19" value:"155" >
5 221 norm header:<ID:7043234338494374907 > lease_grant:<TTL:7200 ID:7043234338494374906 >
5 222 norm header:<ID:3744629086422271696 > put:<key:"key21" value:"156" lease:7043234338494374906 >
5 223 norm header:<ID:7179186752245548309 > put:<key:"key25" value:"157" lease:7043234338494374906 >
5 224 norm header:<ID:7179186752245548310 > compaction:<revision:160 >
5 225 norm header:<ID:3744629086422271702 > lease_grant:<TTL:7200 ID:3744629086422271701 >
5 226 norm header:<ID:7043234338494374909 > compaction:<revision:160 >
5 227 norm header:<ID:7043234338494374910 > put:<key:"key27" value:"158" >
5 228 norm header:<ID:3744629086422271703 > put:<key:"key24" value:"159" lease:3744629086422271701 >
5 229 norm header:<ID:3744629086422271705 > put:<key:"key16" value:"161" >
5 230 norm header:<ID:3744629086422271704 > txn:<success:<request_put:<key:"key23" value:"160" > > success:<request_delete_range:<key:"key27" > > success:<request_range:<key:"key24" > > success:<request_range:<key:"key21" > > >
5 231 norm header:<ID:7043234338494374912 > txn:<success:<request_range:<key:"key18" > > success:<request_put:<key:"key19" value:"162" > > success:<request_delete_range:<key:"key25" > > success:<request_put:<key:"key16" value:"163" > > >
5 232 norm header:<ID:3744629086422271706 > compaction:<revision:168 >
5 233 norm header:<ID:3744629086422271707 > put:<key:"key27" value:"164" >
5 234 norm header:<ID:3744629086422271708 > put:<key:"key17" value:"165" lease:3744629086422271701 >
5 235 norm header:<ID:3744629086422271709 > delete_range:<key:"key2" >
5 236 norm header:<ID:7043234338494374919 > put:<key:"key19" value:"166" >
5 237 norm header:<ID:7179186752245548320 > compaction:<revision:170 >
5 238 norm header:<ID:7179186752245548321 > put:<key:"key17" value:"167" >
6 239 norm
7 240 norm
8 241 norm
9 242 norm
Entry types (Normal,ConfigChange) count is : 38
$ ./etcd-dump-logs ~/Downloads/var/report/server-etcd1/
Snapshot:
term=9 index=242 nodes=[2c6c7b8d0d0933f7 99eab3685d8363a1 dcb68c82481661be] confstate={"voters":[3201069280892892151,11090874293180130209,15904053625288548798],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=99eab3685d8363a1 clusterID=3f51d2f720a184e6 term=9 commitIndex=242 vote=0
WAL entries: 0
term index type data
Entry types (Normal,ConfigChange) count is : 0
$ ./etcd-dump-logs ~/Downloads/var/report/server-etcd2/
Snapshot:
term=5 index=204 nodes=[2c6c7b8d0d0933f7 99eab3685d8363a1 dcb68c82481661be] confstate={"voters":[3201069280892892151,11090874293180130209,15904053625288548798],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=dcb68c82481661be clusterID=3f51d2f720a184e6 term=9 commitIndex=240 vote=dcb68c82481661be
WAL entries: 38
lastIndex=242
term index type data
5 205 norm header:<ID:3744629086422271679 > put:<key:"key17" value:"146" >
5 206 norm header:<ID:3744629086422271680 > put:<key:"key26" value:"147" >
5 207 norm header:<ID:7043234338494374897 > put:<key:"key24" value:"148" >
5 208 norm header:<ID:7179186752245548303 > compaction:<revision:151 >
5 209 norm header:<ID:7043234338494374899 > put:<key:"key21" value:"149" >
5 210 norm header:<ID:3744629086422271685 > lease_revoke:<ID:3744629086422271649 >
5 211 norm header:<ID:7043234338494374900 > lease_revoke:<ID:7043234338494374771 >
5 212 norm header:<ID:3744629086422271686 > put:<key:"key18" value:"150" >
5 213 norm header:<ID:7179186752245548305 > delete_range:<key:"key26" >
5 214 norm header:<ID:3744629086422271687 > put:<key:"key16" value:"151" >
5 215 norm header:<ID:7043234338494374901 > compaction:<revision:151 >
5 216 norm header:<ID:3744629086422271688 > put:<key:"key17" value:"152" >
5 217 norm header:<ID:3744629086422271689 > put:<key:"key16" value:"153" >
5 218 norm header:<ID:7043234338494374905 > put:<key:"key19" value:"154" >
5 219 norm header:<ID:3744629086422271690 > compaction:<revision:159 >
5 220 norm header:<ID:3744629086422271692 > put:<key:"key19" value:"155" >
5 221 norm header:<ID:7043234338494374907 > lease_grant:<TTL:7200 ID:7043234338494374906 >
5 222 norm header:<ID:3744629086422271696 > put:<key:"key21" value:"156" lease:7043234338494374906 >
5 223 norm header:<ID:7179186752245548309 > put:<key:"key25" value:"157" lease:7043234338494374906 >
5 224 norm header:<ID:7179186752245548310 > compaction:<revision:160 >
5 225 norm header:<ID:3744629086422271702 > lease_grant:<TTL:7200 ID:3744629086422271701 >
5 226 norm header:<ID:7043234338494374909 > compaction:<revision:160 >
5 227 norm header:<ID:7043234338494374910 > put:<key:"key27" value:"158" >
5 228 norm header:<ID:3744629086422271703 > put:<key:"key24" value:"159" lease:3744629086422271701 >
5 229 norm header:<ID:3744629086422271705 > put:<key:"key16" value:"161" >
5 230 norm header:<ID:3744629086422271704 > txn:<success:<request_put:<key:"key23" value:"160" > > success:<request_delete_range:<key:"key27" > > success:<request_range:<key:"key24" > > success:<request_range:<key:"key21" > > >
5 231 norm header:<ID:7043234338494374912 > txn:<success:<request_range:<key:"key18" > > success:<request_put:<key:"key19" value:"162" > > success:<request_delete_range:<key:"key25" > > success:<request_put:<key:"key16" value:"163" > > >
5 232 norm header:<ID:3744629086422271706 > compaction:<revision:168 >
5 233 norm header:<ID:3744629086422271707 > put:<key:"key27" value:"164" >
5 234 norm header:<ID:3744629086422271708 > put:<key:"key17" value:"165" lease:3744629086422271701 >
5 235 norm header:<ID:3744629086422271709 > delete_range:<key:"key2" >
5 236 norm header:<ID:7043234338494374919 > put:<key:"key19" value:"166" >
5 237 norm header:<ID:7179186752245548320 > compaction:<revision:170 >
5 238 norm header:<ID:7179186752245548321 > put:<key:"key17" value:"167" >
6 239 norm
7 240 norm
8 241 norm
9 242 norm
Entry types (Normal,ConfigChange) count is : 38
Possible directions to investigate
Not sure it's a test issue or etcdserver issue.
I see that after key "key26" being removed, there were several compaction operations.
- The watch response of rev 3 on "key26" isn't correct.
- The test side just records whatever responded by the etcdserver, it seems unlikely the client side (test) issue. But please double check anyway.
- For the etcdserver side, not sure whether it's related to https://github.com/etcd-io/etcd/pull/18274. Will the deletion + compactions cause temporary revision issue? need investigation.
- since all the data look good, so I think after rebooting all members everything will be running well. It might worth double check.
After https://github.com/etcd-io/etcd/pull/18274 got merged, the tombstone won't be removed, but somehow I did not see the key26 in db file anymore. We need to get this sorted out.
So there are 4 items to double check/investigate. Contributions is welcome!
The issue description isn't user friendly. I guess some contributors might not get what's the exact issue.
Yea, sorry I uploaded it during my vacation. Didn't have time to analyse it, just wanted to publish it so other maintainers can take a look. Think we might want to standardize a process of reporting a robustness/antithesis failure issues, where we would run the commands you run. That would make it a more accessible other contributors, however I don't expect that non-maintainer contributor would be able to debug them.
Think we might want to standardize a process of reporting a robustness/antithesis failure issues
I think we just need to clearly clarify the issue itself from test perspective. Any other analysis is nice to have but not essential. For this case, we should focus on how it breaks linearizability, and clearly describe what the porcupine graph shows.
Adding more context here to help with the investigation: Bug report
Any section that might be confusing has some explanation in our documentation here. What I gather from the report are that the interesting events related to the failure start happening 34 seconds before the assertion failure.
The statistical debug information is another section that runs some analysis on code paths that were hit when the bug manifested. Looks to point strongly towards a line in kvstore.go (but the ln # and col # are instrumented code so it might not match up with the etcd github code).
Interesting, the statistical debug information has highlighted a piece of code that was problematic before:
https://github.com/etcd-io/etcd/blob/a8396186acc33606ae6d0b40515b1fc50c93116b/server/storage/mvcc/kvstore.go#L479-L481
This code was last changed 7 months ago to fix https://github.com/etcd-io/etcd/issues/19179
ping @fuweid
@serathius Pointing out here that the line #s and col #s might be on instrumented code, which is something that I would have to check. The statistical debug information is a section that I do not have much insight behind, so I'd have to check under the hood on how it is calculated
Looks like we have another reproduction of a linearization failure from this report: https://linuxfoundation.antithesis.com/report/w5AjMMKhwqLDksK6CgDDoyUawpFSw54Qw45j/8jICMyoSWR24qW-J6NfeaukB31JBNpQlD3MVIwmJvC4.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA4LTI2VDEzOjE2OjQ4Ljc2NDIyNTU0OFoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiOGpJQ015b1NXUjI0cVctSjZOZmVhdWtCMzFKQk5wUWxEM01WSXdtSnZDNC5odG1sIiwicmVwb3J0X2lkIjoidzVBak1NS2h3cUxEa3NLNkNnRERveVVhd3BGU3c1NFF3NDVqIn19fUsvbqFp7xtGva25OumBuZlxCyW-5SDeRKEUPrKC5H3sg7zb6rXd90k_s0DKcMETux49_53MRd-Id6mJDCNnfQs
I'm looking into why the artifact isn't on the report!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Dumping my investigation from today. /cc @marcus-hodgson-antithesis
Did some digging into Linearization issue from todays run https://linuxfoundation.antithesis.com/report/oy4cYdwPtTZ92-1vcsNUcS14/gbqIHfXuci7MQKa4jelp1hXZbGUJXZSUszEDPpF86oc.html?auth=v2.public.eyJuYmYiOiIyMDI1LTExLTE2VDEyOjQ5OjAzLjE2MTgzNTQ1MloiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiZ2JxSUhmWHVjaTdNUUthNGplbHAxaFhaYkdVSlhaU1VzekVEUHBGODZvYy5odG1sIiwicmVwb3J0X2lkIjoib3k0Y1lkd1B0VFo5Mi0xdmNzTlVjUzE0In19fcFkIGv1yVU12NnAB6ocv8R7BKVkg9hnmp9ZGARoOjQQMSLLjNz-MdJRsRm5og7QvM3lRp9W8RpoTKjJ47jH3gs
Session:
replay_moment = Moment._exact(
Session.from({id: "e4c3252251b551d2c0219142b50a347e-42-5"}),
'-402622214258029707',
92.02767571923323
)
Created a debug environment: https://linuxfoundation.antithesis.com/report/fcEDfR72ioroT6gRNGAIr1sM/5gKupyGGCgsQTyXTVUFpse2AO5YhsHbZ3kAnxcM9ep0.html?auth=v2.public.eyJuYmYiOiIyMDI1LTExLTE2VDE1OjA4OjI0LjUxMzA4MDM3MVoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiNWdLdXB5R0dDZ3NRVHlYVFZVRnBzZTJBTzVZaHNIYloza0FueGNNOWVwMC5odG1sIiwicmVwb3J0X2lkIjoiZmNFRGZSNzJpb3JvVDZnUk5HQUlyMXNNIn19fWkyLH-iRBH8AIjp3S0ZpdjaXEwKCdD6CrrhKtdKXM6kN3cZin3irAhBTkSPxg2yV2voNxb3mOi4inq2jozwCQM
And wrote the following code with the debugger playbook
notebook_version(2, true)
[environment, moment] = prepare_multiverse_debugging()
rewind_time = 55
branch = moment.rewind(Time.seconds(rewind_time)).branch()
print(bash`cat -n /etcd/customer/server/etcdserver/v3_server.go`.run({branch: branch, container: "etcd0" }))
print(bash`dlv attach 1 --allow-non-terminal-interactive <<< "trace fill go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*header).fill
on fill print rh.Revision
trace read server/etcdserver/v3_server.go:918
on read print confirmedIndex
continue"`.run_in_background({branch: branch, container: "etcd0" }))
print(bash`dlv attach 1 --allow-non-terminal-interactive <<< "trace fill go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*header).fill
on fill print rh.Revision
trace read server/etcdserver/v3_server.go:918
on read print confirmedIndex
continue"`.run_in_background({branch: branch, container: "etcd1" }))
print(bash`dlv attach 1 --allow-non-terminal-interactive <<< "trace fill go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*header).fill
on fill print rh.Revision
trace read server/etcdserver/v3_server.go:918
on read print confirmedIndex
continue"`.run_in_background({branch: branch, container:"etcd2" }))
branch.wait({duration:Time.seconds(20)})
print(environment.events.up_to(branch))
download(environment.extract_file({
moment: branch.end,
path: '/var/report/history.html',
container: 'client'
}))
Attached report
Debugger logs of confirmedIndex variable
40.040 etcd0 confirmedIndex: 12
40.061 etcd2 confirmedIndex: 12
40.265 etcd1 confirmedIndex: 8
41.954 etcd0 confirmedIndex: 15
42.509 etcd1 confirmedIndex: 8
43.538 etcd0 confirmedIndex: 15
46.042 etcd0 confirmedIndex: 15
46.170 etcd1 confirmedIndex: 15
47.476 etcd2 confirmedIndex: 16
47.490 etcd1 confirmedIndex: 16
47.908 etcd2 confirmedIndex: 17
49.084 etcd0 confirmedIndex: 17
What I observed was that etcd1 was getting lower readIndex than other nodes.
etcd1 was still on index 8 on time 40.265, when 100ms before that etcd2 was already on index 12. That would imply a bug in getting read index.
Just before that there was a network slowdown.
38.850 {fault:{affected_nodes:[ALL],details:{asymmetric:false,disruption_type:Slowed,drop_rate:0,latency:{deviation:1542,mean:1715.138096},partitions:[[etcd1,client],[etcd2,etcd0]]},max_duration:2.941884403,name:partition,type:network}}
Logs from that period:
39.826 {"level":"warn","ts":"2025-11-16T00:15:22.265101Z","caller":"etcdserver/v3_server.go:993","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":7179189198565699338,"retry-timeout":"500ms"}
39.827 {"level":"warn","ts":"2025-11-16T00:15:22.266350Z","caller":"etcdserver/raft.go:388","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"dcb68c82481661be","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"2.406773972s"}
39.827 {"level":"warn","ts":"2025-11-16T00:15:22.266359Z","caller":"etcdserver/raft.go:388","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"2c6c7b8d0d0933f7","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"2.406780589s"}
40.266 {"level":"info","ts":"2025-11-16T00:15:22.704955Z","caller":"traceutil/trace.go:195","msg":"trace[1449946594] linearizableReadLoop","detail":"{readStateIndex:8; appliedIndex:8; }","duration":"3.448258519s","start":"2025-11-16T00:15:19.256691Z","end":"2025-11-16T00:15:22.704950Z","steps":["trace[1449946594] 'read index received' (duration: 3.448251188s)","trace[1449946594] 'applied index is now lower than readState.Index' (duration: 6.139µs)"],"step_count":2}
40.266 {"level":"warn","ts":"2025-11-16T00:15:22.705015Z","caller":"txn/util.go:93","msg":"apply request took too long","took":"3.448324919s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"key\" range_end:\"kez\" ","response":"range_response_count:0 size:4"}
40.266 {"level":"info","ts":"2025-11-16T00:15:22.705042Z","caller":"traceutil/trace.go:195","msg":"trace[2119179892] range","detail":"{range_begin:key; range_end:kez; response_count:0; response_revision:1; }","duration":"3.448345303s","start":"2025-11-16T00:15:19.256688Z","end":"2025-11-16T00:15:22.705033Z","steps":["trace[2119179892] 'agreement among raft nodes before linearized reading' (duration: 3.448303222s)"],"step_count":1}
40.270 {"level":"warn","ts":"2025-11-16T00:15:22.709731Z","caller":"v3rpc/interceptor.go:202","msg":"request stats","start time":"2025-11-16T00:15:19.256678Z","time spent":"3.453042388s","remote":"10.89.0.6:36954","response type":"/etcdserverpb.KV/Range","request count":0,"request size":10,"response count":0,"response size":27,"request content":"key:\"key\" range_end:\"kez\" "}
40.735 {"level":"warn","ts":"2025-11-16T00:15:23.174320Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"2c6c7b8d0d0933f7","rtt":"0s","error":"dial tcp: lookup etcd0 on 10.89.0.1:53: read udp 10.89.0.3:60637->10.89.0.1:53: i/o timeout"}
40.735 {"level":"warn","ts":"2025-11-16T00:15:23.174334Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"2c6c7b8d0d0933f7","rtt":"0s","error":"dial tcp: lookup etcd0 on 10.89.0.1:53: read udp 10.89.0.3:60637->10.89.0.1:53: i/o timeout"}
40.742 {"level":"warn","ts":"2025-11-16T00:15:23.181524Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"dcb68c82481661be","rtt":"0s","error":"dial tcp: lookup etcd2 on 10.89.0.1:53: read udp 10.89.0.3:46205->10.89.0.1:53: i/o timeout"}