etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Antithesis found linearization issue

Open serathius opened this issue 5 months ago • 11 comments

Bug report criteria

What happened?

Looks like one member revision has not progressed for long time, but still it accepted reads and writes.

Image

Antithesis report: https://linuxfoundation.antithesis.com/report/wo0cw7nDtn0iw5FGwqcQw74OwqTDsCHCmWnChA/OiREqJT0lJ844qOJlgc9cH7B1ZQPzFkGta7JuosTUkI.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA3LTI4VDIyOjA2OjA0LjU0OTc5NjYyOVoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiT2lSRXFKVDBsSjg0NHFPSmxnYzljSDdCMVpRUHpGa0d0YTdKdW9zVFVrSS5odG1sIiwicmVwb3J0X2lkIjoid28wY3c3bkR0bjBpdzVGR3dxY1F3NzRPd3FURHNDSENtV25DaEEifX19XfWyBtmDif1hjpNeYKknXOwpr59Clf0Rxu6I9i_DIA3LEMr0aqIyyPjTUnk5riEPMy5TJTIlmjuqJ4I7yGTuAQ#/run/00b4cc5e35594f082f0c353952397cd3-34-6/finding/5a95b2983bca202814eaa6a3fe594910a72cd2c6

var_report_dump.tar.gz /cc @fuweid @ahrtr @nwnt @henrybear327 @joshjms @siyuanfoundation

What did you expect to happen?

No linearization issues

How can we reproduce it (as minimally and precisely as possible)?

contact @marcus-hodgson-antithesis

Anything else we need to know?

No response

Etcd version (please run commands below)

main branch

Etcd configuration (command line flags or environment variables)

ETCD_SNAPSHOT_CATCHUP_ENTRIES: 100 ETCD_SNAPSHOT_COUNT: 50 ETCD_COMPACTION_BATCH_LIMIT: 10

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output


serathius avatar Jul 29 '25 09:07 serathius

Clarification on the issue

The issue description isn't user friendly. I guess some contributors might not get what's the exact issue. The put request (by client-4 on "key26") get a response with rev 3, which doesn't match the watch response (which has revision 155) received by other clients

In client-4/watch.json

    {
      "Events": [
        {
          "Type": "put-operation",
          "Key": "key26",
          "Value": {
            "Value": "147"
          },
          "Revision": 3,
          "IsCreate": true
        }
      ],
      "Revision": 3,
      "Time": 36609617769
    }

In client-6/operations.json

        {
          "Key": "key26",
          "Value": {
            "Value": "147"
          },
          "ModRevision": 3,
          "Version": 1
        }

In other client-x/watch.json

      "Events": [
        {
          "Type": "put-operation",
          "Key": "key26",
          "Value": {
            "Value": "147"
          },
          "Revision": 155
        }
      ],

Did not see any issue on data (DB and WAL files)

I checked the data (including db and WAL files), did not see any issues (except the missing tombstone revision explained in next section).

$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd0/member/snap/db key --decode
rev={Revision:{Main:176 Sub:0} tombstone:false}, value=[key "key17" | val "167" | created 90 | mod 176 | ver 8]
rev={Revision:{Main:175 Sub:0} tombstone:false}, value=[key "key19" | val "166" | created 102 | mod 175 | ver 11]
rev={Revision:{Main:174 Sub:0} tombstone:true}, value=[key "key2" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:173 Sub:0} tombstone:false}, value=[key "key17" | val "165" | created 90 | mod 173 | ver 7]
rev={Revision:{Main:172 Sub:0} tombstone:false}, value=[key "key27" | val "164" | created 172 | mod 172 | ver 1]
rev={Revision:{Main:171 Sub:2} tombstone:false}, value=[key "key16" | val "163" | created 113 | mod 171 | ver 9]
rev={Revision:{Main:171 Sub:1} tombstone:true}, value=[key "key25" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:171 Sub:0} tombstone:false}, value=[key "key19" | val "162" | created 102 | mod 171 | ver 10]
rev={Revision:{Main:170 Sub:1} tombstone:true}, value=[key "key27" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:170 Sub:0} tombstone:false}, value=[key "key23" | val "160" | created 126 | mod 170 | ver 5]
rev={Revision:{Main:169 Sub:0} tombstone:false}, value=[key "key16" | val "161" | created 113 | mod 169 | ver 8]
rev={Revision:{Main:168 Sub:0} tombstone:false}, value=[key "key24" | val "159" | created 140 | mod 168 | ver 6]
rev={Revision:{Main:166 Sub:0} tombstone:false}, value=[key "key25" | val "157" | created 153 | mod 166 | ver 2]
rev={Revision:{Main:165 Sub:0} tombstone:false}, value=[key "key21" | val "156" | created 105 | mod 165 | ver 7]
rev={Revision:{Main:164 Sub:0} tombstone:false}, value=[key "key19" | val "155" | created 102 | mod 164 | ver 9]
rev={Revision:{Main:161 Sub:0} tombstone:false}, value=[key "key17" | val "152" | created 90 | mod 161 | ver 6]
rev={Revision:{Main:158 Sub:0} tombstone:false}, value=[key "key18" | val "150" | created 114 | mod 158 | ver 8]
rev={Revision:{Main:150 Sub:0} tombstone:false}, value=[key "key2" | val "142" | created 144 | mod 150 | ver 2]

$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd1/member/snap/db key --decode
rev={Revision:{Main:176 Sub:0} tombstone:false}, value=[key "key17" | val "167" | created 90 | mod 176 | ver 8]
rev={Revision:{Main:175 Sub:0} tombstone:false}, value=[key "key19" | val "166" | created 102 | mod 175 | ver 11]
rev={Revision:{Main:174 Sub:0} tombstone:true}, value=[key "key2" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:173 Sub:0} tombstone:false}, value=[key "key17" | val "165" | created 90 | mod 173 | ver 7]
rev={Revision:{Main:172 Sub:0} tombstone:false}, value=[key "key27" | val "164" | created 172 | mod 172 | ver 1]
rev={Revision:{Main:171 Sub:2} tombstone:false}, value=[key "key16" | val "163" | created 113 | mod 171 | ver 9]
rev={Revision:{Main:171 Sub:1} tombstone:true}, value=[key "key25" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:171 Sub:0} tombstone:false}, value=[key "key19" | val "162" | created 102 | mod 171 | ver 10]
rev={Revision:{Main:170 Sub:1} tombstone:true}, value=[key "key27" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:170 Sub:0} tombstone:false}, value=[key "key23" | val "160" | created 126 | mod 170 | ver 5]
rev={Revision:{Main:169 Sub:0} tombstone:false}, value=[key "key16" | val "161" | created 113 | mod 169 | ver 8]
rev={Revision:{Main:168 Sub:0} tombstone:false}, value=[key "key24" | val "159" | created 140 | mod 168 | ver 6]
rev={Revision:{Main:166 Sub:0} tombstone:false}, value=[key "key25" | val "157" | created 153 | mod 166 | ver 2]
rev={Revision:{Main:165 Sub:0} tombstone:false}, value=[key "key21" | val "156" | created 105 | mod 165 | ver 7]
rev={Revision:{Main:164 Sub:0} tombstone:false}, value=[key "key19" | val "155" | created 102 | mod 164 | ver 9]
rev={Revision:{Main:161 Sub:0} tombstone:false}, value=[key "key17" | val "152" | created 90 | mod 161 | ver 6]
rev={Revision:{Main:158 Sub:0} tombstone:false}, value=[key "key18" | val "150" | created 114 | mod 158 | ver 8]
rev={Revision:{Main:150 Sub:0} tombstone:false}, value=[key "key2" | val "142" | created 144 | mod 150 | ver 2]

F5WPP69Q7H:etcd-dump-db wachao$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd2/member/snap/db key --decode
rev={Revision:{Main:176 Sub:0} tombstone:false}, value=[key "key17" | val "167" | created 90 | mod 176 | ver 8]
rev={Revision:{Main:175 Sub:0} tombstone:false}, value=[key "key19" | val "166" | created 102 | mod 175 | ver 11]
rev={Revision:{Main:174 Sub:0} tombstone:true}, value=[key "key2" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:173 Sub:0} tombstone:false}, value=[key "key17" | val "165" | created 90 | mod 173 | ver 7]
rev={Revision:{Main:172 Sub:0} tombstone:false}, value=[key "key27" | val "164" | created 172 | mod 172 | ver 1]
rev={Revision:{Main:171 Sub:2} tombstone:false}, value=[key "key16" | val "163" | created 113 | mod 171 | ver 9]
rev={Revision:{Main:171 Sub:1} tombstone:true}, value=[key "key25" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:171 Sub:0} tombstone:false}, value=[key "key19" | val "162" | created 102 | mod 171 | ver 10]
rev={Revision:{Main:170 Sub:1} tombstone:true}, value=[key "key27" | val "" | created 0 | mod 0 | ver 0]
rev={Revision:{Main:170 Sub:0} tombstone:false}, value=[key "key23" | val "160" | created 126 | mod 170 | ver 5]
rev={Revision:{Main:169 Sub:0} tombstone:false}, value=[key "key16" | val "161" | created 113 | mod 169 | ver 8]
rev={Revision:{Main:168 Sub:0} tombstone:false}, value=[key "key24" | val "159" | created 140 | mod 168 | ver 6]
rev={Revision:{Main:166 Sub:0} tombstone:false}, value=[key "key25" | val "157" | created 153 | mod 166 | ver 2]
rev={Revision:{Main:165 Sub:0} tombstone:false}, value=[key "key21" | val "156" | created 105 | mod 165 | ver 7]
rev={Revision:{Main:164 Sub:0} tombstone:false}, value=[key "key19" | val "155" | created 102 | mod 164 | ver 9]
rev={Revision:{Main:161 Sub:0} tombstone:false}, value=[key "key17" | val "152" | created 90 | mod 161 | ver 6]
rev={Revision:{Main:158 Sub:0} tombstone:false}, value=[key "key18" | val "150" | created 114 | mod 158 | ver 8]
rev={Revision:{Main:150 Sub:0} tombstone:false}, value=[key "key2" | val "142" | created 144 | mod 150 | ver 2]

$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd0/member/snap/db meta --decode
key="term", value=5
key="storageVersion", value="3.7.0"
key="scheduledCompactRev", value={170 0}
key="finishedCompactRev", value={170 0}
key="consistent_index", value=238
key="confState", value="{\"voters\":[3201069280892892151,11090874293180130209,15904053625288548798],\"auto_leave\":false}"

$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd1/member/snap/db meta --decode
key="term", value=9
key="storageVersion", value="3.7.0"
key="scheduledCompactRev", value={170 0}
key="finishedCompactRev", value={170 0}
key="consistent_index", value=242
key="confState", value="{\"voters\":[3201069280892892151,11090874293180130209,15904053625288548798],\"auto_leave\":false}"

$ ./etcd-dump-db iterate-bucket ~/Downloads/var/report/server-etcd2/member/snap/db meta --decode
key="term", value=9
key="storageVersion", value="3.7.0"
key="scheduledCompactRev", value={170 0}
key="finishedCompactRev", value={170 0}
key="consistent_index", value=242
key="confState", value="{\"voters\":[3201069280892892151,11090874293180130209,15904053625288548798],\"auto_leave\":false}"
F5WPP69Q7H:etcd-dump-db wachao$ 
F5WPP69Q7H:etcd-dump-db wachao$ 

$ ./etcd-dump-logs ~/Downloads/var/report/server-etcd0/
Snapshot:
term=5 index=204 nodes=[2c6c7b8d0d0933f7 99eab3685d8363a1 dcb68c82481661be] confstate={"voters":[3201069280892892151,11090874293180130209,15904053625288548798],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=2c6c7b8d0d0933f7 clusterID=3f51d2f720a184e6 term=9 commitIndex=240 vote=dcb68c82481661be
WAL entries: 38
lastIndex=242
term	     index	type	data
   5	       205	norm	header:<ID:3744629086422271679 > put:<key:"key17" value:"146" > 
   5	       206	norm	header:<ID:3744629086422271680 > put:<key:"key26" value:"147" > 
   5	       207	norm	header:<ID:7043234338494374897 > put:<key:"key24" value:"148" > 
   5	       208	norm	header:<ID:7179186752245548303 > compaction:<revision:151 > 
   5	       209	norm	header:<ID:7043234338494374899 > put:<key:"key21" value:"149" > 
   5	       210	norm	header:<ID:3744629086422271685 > lease_revoke:<ID:3744629086422271649 > 
   5	       211	norm	header:<ID:7043234338494374900 > lease_revoke:<ID:7043234338494374771 > 
   5	       212	norm	header:<ID:3744629086422271686 > put:<key:"key18" value:"150" > 
   5	       213	norm	header:<ID:7179186752245548305 > delete_range:<key:"key26" > 
   5	       214	norm	header:<ID:3744629086422271687 > put:<key:"key16" value:"151" > 
   5	       215	norm	header:<ID:7043234338494374901 > compaction:<revision:151 > 
   5	       216	norm	header:<ID:3744629086422271688 > put:<key:"key17" value:"152" > 
   5	       217	norm	header:<ID:3744629086422271689 > put:<key:"key16" value:"153" > 
   5	       218	norm	header:<ID:7043234338494374905 > put:<key:"key19" value:"154" > 
   5	       219	norm	header:<ID:3744629086422271690 > compaction:<revision:159 > 
   5	       220	norm	header:<ID:3744629086422271692 > put:<key:"key19" value:"155" > 
   5	       221	norm	header:<ID:7043234338494374907 > lease_grant:<TTL:7200 ID:7043234338494374906 > 
   5	       222	norm	header:<ID:3744629086422271696 > put:<key:"key21" value:"156" lease:7043234338494374906 > 
   5	       223	norm	header:<ID:7179186752245548309 > put:<key:"key25" value:"157" lease:7043234338494374906 > 
   5	       224	norm	header:<ID:7179186752245548310 > compaction:<revision:160 > 
   5	       225	norm	header:<ID:3744629086422271702 > lease_grant:<TTL:7200 ID:3744629086422271701 > 
   5	       226	norm	header:<ID:7043234338494374909 > compaction:<revision:160 > 
   5	       227	norm	header:<ID:7043234338494374910 > put:<key:"key27" value:"158" > 
   5	       228	norm	header:<ID:3744629086422271703 > put:<key:"key24" value:"159" lease:3744629086422271701 > 
   5	       229	norm	header:<ID:3744629086422271705 > put:<key:"key16" value:"161" > 
   5	       230	norm	header:<ID:3744629086422271704 > txn:<success:<request_put:<key:"key23" value:"160" > > success:<request_delete_range:<key:"key27" > > success:<request_range:<key:"key24" > > success:<request_range:<key:"key21" > > > 
   5	       231	norm	header:<ID:7043234338494374912 > txn:<success:<request_range:<key:"key18" > > success:<request_put:<key:"key19" value:"162" > > success:<request_delete_range:<key:"key25" > > success:<request_put:<key:"key16" value:"163" > > > 
   5	       232	norm	header:<ID:3744629086422271706 > compaction:<revision:168 > 
   5	       233	norm	header:<ID:3744629086422271707 > put:<key:"key27" value:"164" > 
   5	       234	norm	header:<ID:3744629086422271708 > put:<key:"key17" value:"165" lease:3744629086422271701 > 
   5	       235	norm	header:<ID:3744629086422271709 > delete_range:<key:"key2" > 
   5	       236	norm	header:<ID:7043234338494374919 > put:<key:"key19" value:"166" > 
   5	       237	norm	header:<ID:7179186752245548320 > compaction:<revision:170 > 
   5	       238	norm	header:<ID:7179186752245548321 > put:<key:"key17" value:"167" > 
   6	       239	norm	
   7	       240	norm	
   8	       241	norm	
   9	       242	norm	

Entry types (Normal,ConfigChange) count is : 38

$ ./etcd-dump-logs ~/Downloads/var/report/server-etcd1/
Snapshot:
term=9 index=242 nodes=[2c6c7b8d0d0933f7 99eab3685d8363a1 dcb68c82481661be] confstate={"voters":[3201069280892892151,11090874293180130209,15904053625288548798],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=99eab3685d8363a1 clusterID=3f51d2f720a184e6 term=9 commitIndex=242 vote=0
WAL entries: 0
term	     index	type	data

Entry types (Normal,ConfigChange) count is : 0

$ ./etcd-dump-logs ~/Downloads/var/report/server-etcd2/
Snapshot:
term=5 index=204 nodes=[2c6c7b8d0d0933f7 99eab3685d8363a1 dcb68c82481661be] confstate={"voters":[3201069280892892151,11090874293180130209,15904053625288548798],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=dcb68c82481661be clusterID=3f51d2f720a184e6 term=9 commitIndex=240 vote=dcb68c82481661be
WAL entries: 38
lastIndex=242
term	     index	type	data
   5	       205	norm	header:<ID:3744629086422271679 > put:<key:"key17" value:"146" > 
   5	       206	norm	header:<ID:3744629086422271680 > put:<key:"key26" value:"147" > 
   5	       207	norm	header:<ID:7043234338494374897 > put:<key:"key24" value:"148" > 
   5	       208	norm	header:<ID:7179186752245548303 > compaction:<revision:151 > 
   5	       209	norm	header:<ID:7043234338494374899 > put:<key:"key21" value:"149" > 
   5	       210	norm	header:<ID:3744629086422271685 > lease_revoke:<ID:3744629086422271649 > 
   5	       211	norm	header:<ID:7043234338494374900 > lease_revoke:<ID:7043234338494374771 > 
   5	       212	norm	header:<ID:3744629086422271686 > put:<key:"key18" value:"150" > 
   5	       213	norm	header:<ID:7179186752245548305 > delete_range:<key:"key26" > 
   5	       214	norm	header:<ID:3744629086422271687 > put:<key:"key16" value:"151" > 
   5	       215	norm	header:<ID:7043234338494374901 > compaction:<revision:151 > 
   5	       216	norm	header:<ID:3744629086422271688 > put:<key:"key17" value:"152" > 
   5	       217	norm	header:<ID:3744629086422271689 > put:<key:"key16" value:"153" > 
   5	       218	norm	header:<ID:7043234338494374905 > put:<key:"key19" value:"154" > 
   5	       219	norm	header:<ID:3744629086422271690 > compaction:<revision:159 > 
   5	       220	norm	header:<ID:3744629086422271692 > put:<key:"key19" value:"155" > 
   5	       221	norm	header:<ID:7043234338494374907 > lease_grant:<TTL:7200 ID:7043234338494374906 > 
   5	       222	norm	header:<ID:3744629086422271696 > put:<key:"key21" value:"156" lease:7043234338494374906 > 
   5	       223	norm	header:<ID:7179186752245548309 > put:<key:"key25" value:"157" lease:7043234338494374906 > 
   5	       224	norm	header:<ID:7179186752245548310 > compaction:<revision:160 > 
   5	       225	norm	header:<ID:3744629086422271702 > lease_grant:<TTL:7200 ID:3744629086422271701 > 
   5	       226	norm	header:<ID:7043234338494374909 > compaction:<revision:160 > 
   5	       227	norm	header:<ID:7043234338494374910 > put:<key:"key27" value:"158" > 
   5	       228	norm	header:<ID:3744629086422271703 > put:<key:"key24" value:"159" lease:3744629086422271701 > 
   5	       229	norm	header:<ID:3744629086422271705 > put:<key:"key16" value:"161" > 
   5	       230	norm	header:<ID:3744629086422271704 > txn:<success:<request_put:<key:"key23" value:"160" > > success:<request_delete_range:<key:"key27" > > success:<request_range:<key:"key24" > > success:<request_range:<key:"key21" > > > 
   5	       231	norm	header:<ID:7043234338494374912 > txn:<success:<request_range:<key:"key18" > > success:<request_put:<key:"key19" value:"162" > > success:<request_delete_range:<key:"key25" > > success:<request_put:<key:"key16" value:"163" > > > 
   5	       232	norm	header:<ID:3744629086422271706 > compaction:<revision:168 > 
   5	       233	norm	header:<ID:3744629086422271707 > put:<key:"key27" value:"164" > 
   5	       234	norm	header:<ID:3744629086422271708 > put:<key:"key17" value:"165" lease:3744629086422271701 > 
   5	       235	norm	header:<ID:3744629086422271709 > delete_range:<key:"key2" > 
   5	       236	norm	header:<ID:7043234338494374919 > put:<key:"key19" value:"166" > 
   5	       237	norm	header:<ID:7179186752245548320 > compaction:<revision:170 > 
   5	       238	norm	header:<ID:7179186752245548321 > put:<key:"key17" value:"167" > 
   6	       239	norm	
   7	       240	norm	
   8	       241	norm	
   9	       242	norm	

Entry types (Normal,ConfigChange) count is : 38

Possible directions to investigate

Not sure it's a test issue or etcdserver issue.

I see that after key "key26" being removed, there were several compaction operations.

  • The watch response of rev 3 on "key26" isn't correct.
    • The test side just records whatever responded by the etcdserver, it seems unlikely the client side (test) issue. But please double check anyway.
    • For the etcdserver side, not sure whether it's related to https://github.com/etcd-io/etcd/pull/18274. Will the deletion + compactions cause temporary revision issue? need investigation.
    • since all the data look good, so I think after rebooting all members everything will be running well. It might worth double check.

After https://github.com/etcd-io/etcd/pull/18274 got merged, the tombstone won't be removed, but somehow I did not see the key26 in db file anymore. We need to get this sorted out.

So there are 4 items to double check/investigate. Contributions is welcome!

ahrtr avatar Jul 29 '25 13:07 ahrtr

The issue description isn't user friendly. I guess some contributors might not get what's the exact issue.

Yea, sorry I uploaded it during my vacation. Didn't have time to analyse it, just wanted to publish it so other maintainers can take a look. Think we might want to standardize a process of reporting a robustness/antithesis failure issues, where we would run the commands you run. That would make it a more accessible other contributors, however I don't expect that non-maintainer contributor would be able to debug them.

serathius avatar Jul 31 '25 12:07 serathius

Think we might want to standardize a process of reporting a robustness/antithesis failure issues

I think we just need to clearly clarify the issue itself from test perspective. Any other analysis is nice to have but not essential. For this case, we should focus on how it breaks linearizability, and clearly describe what the porcupine graph shows.

ahrtr avatar Aug 06 '25 18:08 ahrtr

Adding more context here to help with the investigation: Bug report

Any section that might be confusing has some explanation in our documentation here. What I gather from the report are that the interesting events related to the failure start happening 34 seconds before the assertion failure.

The statistical debug information is another section that runs some analysis on code paths that were hit when the bug manifested. Looks to point strongly towards a line in kvstore.go (but the ln # and col # are instrumented code so it might not match up with the etcd github code).

Interesting, the statistical debug information has highlighted a piece of code that was problematic before:

Image

https://github.com/etcd-io/etcd/blob/a8396186acc33606ae6d0b40515b1fc50c93116b/server/storage/mvcc/kvstore.go#L479-L481

This code was last changed 7 months ago to fix https://github.com/etcd-io/etcd/issues/19179

ping @fuweid

serathius avatar Aug 14 '25 08:08 serathius

@serathius Pointing out here that the line #s and col #s might be on instrumented code, which is something that I would have to check. The statistical debug information is a section that I do not have much insight behind, so I'd have to check under the hood on how it is calculated

Looks like we have another reproduction of a linearization failure from this report: https://linuxfoundation.antithesis.com/report/w5AjMMKhwqLDksK6CgDDoyUawpFSw54Qw45j/8jICMyoSWR24qW-J6NfeaukB31JBNpQlD3MVIwmJvC4.html?auth=v2.public.eyJuYmYiOiIyMDI1LTA4LTI2VDEzOjE2OjQ4Ljc2NDIyNTU0OFoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiOGpJQ015b1NXUjI0cVctSjZOZmVhdWtCMzFKQk5wUWxEM01WSXdtSnZDNC5odG1sIiwicmVwb3J0X2lkIjoidzVBak1NS2h3cUxEa3NLNkNnRERveVVhd3BGU3c1NFF3NDVqIn19fUsvbqFp7xtGva25OumBuZlxCyW-5SDeRKEUPrKC5H3sg7zb6rXd90k_s0DKcMETux49_53MRd-Id6mJDCNnfQs

I'm looking into why the artifact isn't on the report!

var_report_dump_15.tar.gz

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 26 '25 00:10 github-actions[bot]

Dumping my investigation from today. /cc @marcus-hodgson-antithesis

Did some digging into Linearization issue from todays run https://linuxfoundation.antithesis.com/report/oy4cYdwPtTZ92-1vcsNUcS14/gbqIHfXuci7MQKa4jelp1hXZbGUJXZSUszEDPpF86oc.html?auth=v2.public.eyJuYmYiOiIyMDI1LTExLTE2VDEyOjQ5OjAzLjE2MTgzNTQ1MloiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiZ2JxSUhmWHVjaTdNUUthNGplbHAxaFhaYkdVSlhaU1VzekVEUHBGODZvYy5odG1sIiwicmVwb3J0X2lkIjoib3k0Y1lkd1B0VFo5Mi0xdmNzTlVjUzE0In19fcFkIGv1yVU12NnAB6ocv8R7BKVkg9hnmp9ZGARoOjQQMSLLjNz-MdJRsRm5og7QvM3lRp9W8RpoTKjJ47jH3gs

Session:

replay_moment = Moment._exact(
    Session.from({id: "e4c3252251b551d2c0219142b50a347e-42-5"}),
    '-402622214258029707',
    92.02767571923323
)

Created a debug environment: https://linuxfoundation.antithesis.com/report/fcEDfR72ioroT6gRNGAIr1sM/5gKupyGGCgsQTyXTVUFpse2AO5YhsHbZ3kAnxcM9ep0.html?auth=v2.public.eyJuYmYiOiIyMDI1LTExLTE2VDE1OjA4OjI0LjUxMzA4MDM3MVoiLCJzY29wZSI6eyJSZXBvcnRTY29wZVYxIjp7ImFzc2V0IjoiNWdLdXB5R0dDZ3NRVHlYVFZVRnBzZTJBTzVZaHNIYloza0FueGNNOWVwMC5odG1sIiwicmVwb3J0X2lkIjoiZmNFRGZSNzJpb3JvVDZnUk5HQUlyMXNNIn19fWkyLH-iRBH8AIjp3S0ZpdjaXEwKCdD6CrrhKtdKXM6kN3cZin3irAhBTkSPxg2yV2voNxb3mOi4inq2jozwCQM

And wrote the following code with the debugger playbook

notebook_version(2, true)

[environment, moment] = prepare_multiverse_debugging()
rewind_time = 55
branch = moment.rewind(Time.seconds(rewind_time)).branch()
print(bash`cat -n /etcd/customer/server/etcdserver/v3_server.go`.run({branch: branch, container: "etcd0" }))
print(bash`dlv attach 1 --allow-non-terminal-interactive <<< "trace fill go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*header).fill
on fill print rh.Revision
trace read server/etcdserver/v3_server.go:918
on read print confirmedIndex
continue"`.run_in_background({branch: branch, container: "etcd0" }))
print(bash`dlv attach 1 --allow-non-terminal-interactive <<< "trace fill go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*header).fill
on fill print rh.Revision
trace read server/etcdserver/v3_server.go:918
on read print confirmedIndex
continue"`.run_in_background({branch: branch, container: "etcd1" }))
print(bash`dlv attach 1 --allow-non-terminal-interactive <<< "trace fill go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*header).fill
on fill print rh.Revision
trace read server/etcdserver/v3_server.go:918
on read print confirmedIndex
continue"`.run_in_background({branch: branch, container:"etcd2" }))


branch.wait({duration:Time.seconds(20)})
print(environment.events.up_to(branch))

download(environment.extract_file({
    moment: branch.end,
    path: '/var/report/history.html',
    container: 'client'
}))

Attached report

var_report_dump (11).tar.gz

Debugger logs of confirmedIndex variable

40.040 etcd0 confirmedIndex: 12
40.061 etcd2 confirmedIndex: 12
40.265 etcd1 confirmedIndex: 8
41.954 etcd0 confirmedIndex: 15
42.509 etcd1 confirmedIndex: 8
43.538 etcd0 confirmedIndex: 15
46.042 etcd0 confirmedIndex: 15
46.170 etcd1 confirmedIndex: 15
47.476 etcd2 confirmedIndex: 16
47.490 etcd1 confirmedIndex: 16
47.908 etcd2 confirmedIndex: 17
49.084 etcd0 confirmedIndex: 17

What I observed was that etcd1 was getting lower readIndex than other nodes.

etcd1 was still on index 8 on time 40.265, when 100ms before that etcd2 was already on index 12. That would imply a bug in getting read index.

serathius avatar Nov 16 '25 20:11 serathius

Just before that there was a network slowdown.

38.850 {fault:{affected_nodes:[ALL],details:{asymmetric:false,disruption_type:Slowed,drop_rate:0,latency:{deviation:1542,mean:1715.138096},partitions:[[etcd1,client],[etcd2,etcd0]]},max_duration:2.941884403,name:partition,type:network}}

serathius avatar Nov 16 '25 21:11 serathius

Logs from that period:

39.826 {"level":"warn","ts":"2025-11-16T00:15:22.265101Z","caller":"etcdserver/v3_server.go:993","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":7179189198565699338,"retry-timeout":"500ms"}
39.827 {"level":"warn","ts":"2025-11-16T00:15:22.266350Z","caller":"etcdserver/raft.go:388","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"dcb68c82481661be","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"2.406773972s"}
39.827 {"level":"warn","ts":"2025-11-16T00:15:22.266359Z","caller":"etcdserver/raft.go:388","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"2c6c7b8d0d0933f7","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"2.406780589s"}
40.266 {"level":"info","ts":"2025-11-16T00:15:22.704955Z","caller":"traceutil/trace.go:195","msg":"trace[1449946594] linearizableReadLoop","detail":"{readStateIndex:8; appliedIndex:8; }","duration":"3.448258519s","start":"2025-11-16T00:15:19.256691Z","end":"2025-11-16T00:15:22.704950Z","steps":["trace[1449946594] 'read index received'  (duration: 3.448251188s)","trace[1449946594] 'applied index is now lower than readState.Index'  (duration: 6.139µs)"],"step_count":2}
40.266 {"level":"warn","ts":"2025-11-16T00:15:22.705015Z","caller":"txn/util.go:93","msg":"apply request took too long","took":"3.448324919s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"key\" range_end:\"kez\" ","response":"range_response_count:0 size:4"}
40.266 {"level":"info","ts":"2025-11-16T00:15:22.705042Z","caller":"traceutil/trace.go:195","msg":"trace[2119179892] range","detail":"{range_begin:key; range_end:kez; response_count:0; response_revision:1; }","duration":"3.448345303s","start":"2025-11-16T00:15:19.256688Z","end":"2025-11-16T00:15:22.705033Z","steps":["trace[2119179892] 'agreement among raft nodes before linearized reading'  (duration: 3.448303222s)"],"step_count":1}
40.270 {"level":"warn","ts":"2025-11-16T00:15:22.709731Z","caller":"v3rpc/interceptor.go:202","msg":"request stats","start time":"2025-11-16T00:15:19.256678Z","time spent":"3.453042388s","remote":"10.89.0.6:36954","response type":"/etcdserverpb.KV/Range","request count":0,"request size":10,"response count":0,"response size":27,"request content":"key:\"key\" range_end:\"kez\" "}
40.735 {"level":"warn","ts":"2025-11-16T00:15:23.174320Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"2c6c7b8d0d0933f7","rtt":"0s","error":"dial tcp: lookup etcd0 on 10.89.0.1:53: read udp 10.89.0.3:60637->10.89.0.1:53: i/o timeout"}
40.735 {"level":"warn","ts":"2025-11-16T00:15:23.174334Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"2c6c7b8d0d0933f7","rtt":"0s","error":"dial tcp: lookup etcd0 on 10.89.0.1:53: read udp 10.89.0.3:60637->10.89.0.1:53: i/o timeout"}
40.742 {"level":"warn","ts":"2025-11-16T00:15:23.181524Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"dcb68c82481661be","rtt":"0s","error":"dial tcp: lookup etcd2 on 10.89.0.1:53: read udp 10.89.0.3:46205->10.89.0.1:53: i/o timeout"}

serathius avatar Nov 16 '25 21:11 serathius