etcd etcd Go & Java client SDK's retry mechanism may break `Serializable`

Background

Jepsen team raised an issue https://github.com/etcd-io/etcd/issues/14890, and stated that etcd may cause lost update and cyclic information flow. There is a long discussion.

Firstly, there is strong evidence to indicate that it isn't an etcdserver issue, and a key was written twice by the client. Refer to https://github.com/etcd-io/etcd/issues/14890#issuecomment-1346006405. So we thought it might be jetcd or Jepsen's issue.

Eventually it turned out to be caused by client's retry mechanism. Refer to

https://github.com/etcd-io/etcd/blob/9f58999f1b173d9ced04553104f65bc0d31f469d/client/v3/client.go#L273
etcd java client

Note Jepsen uses jetcd (java client). But I believe etcd go client sdk also has this issue.

Breaks `Serializable`

When a database system processes multiple concurrent transactions, it must produces the same effect as some serial execution of those transactions. This is what the Serializable means.

But etcd client sdk's retry (including both go & java) mechanism may break Serializable.

Let's work with an example, assuming there are two concurrent transactions,

transaction 1 (txn1): read k1, and write k2: 20
transaction 2 (txn2): read k2, and write k1: 10

Based on the definition of Serializable, the final result must be the same as executing the two transaction as some serial execution. There are only two possibilities,

execute txn1 first, then txn2
- txn1 read nothing for k1, and write k2 = 20
- txn2 read 20 for k2, and write k1 = 10
execute txn2 first, then txn1
- txn2 read nothing for k2, and write k1 = 10
- txn1 read 10 for k1, and write k2 = 20

But client's retry may lead to a third possibility, see an example workflow below

execute txn1 firstly, read nothing for k1, and write k2 = 20.
- But somehow the client side gets an error response for whatever reason (e.g. temporary network issue);
execute txn2: read 20 for k2, and write k1 = 10
client retries txn1: read 10 for k1, and write k2 = 20

So finally it leads to cyclic information flow, so it breaks Serializable

txn1 reads k1/10, which was written by txn2
txn2 reads k2/20, which was writeen by txn1

Break `Read Committed`

Let's work with an example/workflow,

client 1 sends a request write k/v: 277/1;
client 2 sends a request write k/v: 277/4
client 2 receives a success response; It means 277/4 was successfully persisted;
kill the etcdserver & restart etcdserver;
etcd client sdk retries write k/v: 277/1; so it's also successfully persisted.
- But it's a problem if the client 1 doesn't get a success response for whatever reason, e.g timeout.
client 3 read k:277, but gets 277/1 instead of 277/4.

Obviously, from client perspective, it should read 277/4 in such case, because it's confirmed committed. So it breaks Read Committed.

Note usually breaking Read Committed means client sees uncommitted data or dirty read.

EDIT: even without the client's retry, it's also possible for users to run into this "issue", because it's possible that an user may get a failure response but etcdserver actually has already successfully processed the request. We know it's a little confusing to users, but it isn't an issue from etcd perspective. The Proposal (see below) can mitigate it, but can't completely resolve it.

What did you expect to happen?

etcd should never break Serializable, nor Read committed

How can we reproduce it (as minimally and precisely as possible)?

See workflow mentioned above. We need to create two e2e test cases to reproduce this issue.

We can leverage gofailpoint to reproduce the Serializable issue. When etcdserver receives two transaction requests, it intentionally return a failure response for the first transaction only once; when etcdserver receives the retried failed transaction, it should return success.

We also leverage sleep gofailpoint to interleave the execution of the two transaction.

Proposal

We should guarantee that client never retries when the previous operation may be possible already successful.
- One valid case to retry when the client receives an auth failure
We should expose API to let users to enable/disable the retry.

see also https://github.com/etcd-io/etcd/issues/14890#issuecomment-2276231588

Action

Create an e2e test case to reproduce the "Serializable" issue.
Follow proposal above to resolve the issue

Reference

https://jepsen.io/analyses/jetcd-0.8.2
https://github.com/etcd-io/etcd/issues/14890#issuecomment-2276231588

Aug 08 '24 19:08 ahrtr

@lavacat are you interested in working on this issue as discussed in the community meeting?

Aug 08 '24 19:08 ahrtr

Hi, I'd like to work on this issue and would appreciate some guidance. Could we discuss the details here or on Slack, if that's more convenient?

Aug 12 '24 19:08 parthlaw

Hi, I'd like to work on this issue and would appreciate some guidance. Could we discuss the details here or on Slack, if that's more convenient?

Thank you. Unfortunately, this issue isn't good first issue, not even intermediate; I think it is hard (at least hard to create the e2e test).

@lavacat are you interested in working on this issue as discussed in the community meeting?

Let me know if you are working on this. I will work on it if I do not see a response by the end of next week.

Aug 14 '24 19:08 ahrtr

@ahrtr, yes, will try to find time this week. Please assign to me.

Aug 14 '24 20:08 lavacat

/assign @lavacat

Aug 14 '24 20:08 ahrtr

Potential duplication of non-idempotent requests is a known problem with Raft-based systems. There's half a section dedicated to it in the extended Raft paper, in section 8:

However, as described so far Raft can execute a command multiple times: for example, if the leader crashes after committing the log entry but before responding to the client, the client will retry the command with a new leader, causing it to be executed a second time. The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest serial number processed for each client, along with the as- sociated response. If it receives a command whose serial number has already been executed, it responds immediately without re-executing the request.

The proposed solution is somewhat heavy-handed and requires deep changes - including in the server (because the de-duplication logic should be part of the replicated state machine), but it seems difficult to solve the problem without it.

If you simply tell the client "don't retry non-idempotent operations on failure", it adds significant burden on the user for handling such failures.

Sep 14 '24 16:09 eliben

Your comment is not exactly the same as this issue, but the two are somewhat related.

For a distributed system, it's possible that a client may somehow get an error response due to whatever temporary issue (e.g. network jitter) but actually the server side may have already successfully processed the requests. It's rare, but it happens.

From client perspective, if it gets a successful response, then it can trust the response. But if it gets an error response, it doesn't mean that the server side indeed fails. Note that the following comment focuses on the case that the client gets an error response.

etcd has two kinds of data, key space (key/value) data and non key-space data (i.e. membership data).

Key space

For key space data, etcd supports MVCC (multi-version concurrent control), refer to here. When the client gets an error response for the request against the key space, it has two choice.

The first choice is to simply retry.
- If previous request was indeed failed on the server side, then there is no any issue to retry.
- If previous request was actually successful on the server side, then there will be (at least) two revisions for the same key after the retry (assuming the retry is successful). It's a problem in such case, but not a big problem, because etcd guarantees linearizablity, which means the clients always read the latest data (revision). The other (minor) problem is that the watch clients may get two events against the the same key.
avoid duplication using TXN (refer to here and here). (Note that TXN guarantees atomicity)
- No matter the first execution or the retry, the client always executes the same TXN: check the revision of the key and perform different operations based on the check result. For examples,
  - If the key doesn't exist beforehand,
```
If createRevision == 0 {
    write the k/v
} else {
    read the k/v
}
```
  - If the key already exists beforehand
```
  read the initial revision of the key  # This isn't part of the TXN

  If createRevision == initialRevision {
      write the k/v
  } else {
      read the k/v
  }
```

Obviously the first choice is much simpler, but with minor problems. The second doesn't have the problems, but more complex.

Non-key space

For non-key space (i.e. membership data), etcd doesn't support MVCC. The client can still follow similar patter (check before operation), but it can't use TXN. Please refer to an example in https://github.com/kubernetes/kubeadm/issues/3111.

Sep 14 '24 20:09 ahrtr

any progress? @lavacat

Sep 23 '24 15:09 ahrtr

I'm focusing on reproducing interleaving transactions with default go client. We have 3 possibilities for retry:

We'll ignore 2 for now (but auth is checked before proposal enters raft). For 3, in default case we have nonRepeatable retryPolicy for Txn and it will only retry in case

return desc == "there is no address available" || desc == "there is no connection available"

I think in this case there is no chance first attempt got to raft.

We left only with 1. It becomes interesting. I'd expect that if server side generates Cancel or Deadline error, client should retry. But that's not the case for Txn. parseProposeCtxErr converts context.Canceled to errors.ErrCanceled and context.DeadlineExceeded to errors.ErrTimeoutDueToConnectionLost or errors.ErrTimeout. The last 2 map to code Unavailable. Only ErrGRPCCanceled maps to code Canceled that will be retried. But toGRPCErrorMap is missing errors.ErrCanceled: rpctypes.ErrGRPCCanceled entry (maybe a bug).

So, it's not possible for Txn to retry.

For java client logic is different and I think that's the reason Jepsen test failed.

Sep 24 '24 09:09 lavacat

But toGRPCErrorMap is missing errors.ErrCanceled: rpctypes.ErrGRPCCanceled entry (maybe a bug).

Indeed it seems like a bug. But let's do not change it until there is a clear summary on the existing errors (see also https://github.com/etcd-io/etcd/issues/18493#issuecomment-2345842185).

You can intentionally inject an error in processInternalRaftRequestOnce in the case x := <-ch to return a context.Canceled directly.

You can add a filter to only return a context.Canceled on a certain request type (i.e. Put) and on certain key (i.e. "foo").
Ensure the failpoint only fails once, namely only the first time fails and all the following requests all succeed. Refer to https://github.com/etcd-io/gofail/blob/32cd1c822104138c918eeb9c0fb4cc707064b821/runtime/terms.go#L65

Sep 24 '24 20:09 ahrtr

It might not be safe to automatically retry when seeing context error, because the server side may have successfully processed the request.

https://github.com/etcd-io/etcd/blob/c79c7d5440f53600836f9d1d69441ef5541f719a/client/v3/retry_interceptor.go#L72

https://github.com/etcd-io/etcd/blob/c79c7d5440f53600836f9d1d69441ef5541f719a/client/v3/retry_interceptor.go#L348-L350

Based on the investigation that we have done so far.

The original design wanted to automatically retry in context error (which may not be safe as mentioned above);
but the implementation doesn't match the design, and the client side will never see a context error (so the final behaviour is correct) as mentioned above in https://github.com/etcd-io/etcd/issues/18424#issuecomment-2370700961.

So this issue can't be reproduced with etcd golang SDK. So I will unpin the issue and deescalate the priority.

Proposed followup actions:

generate a summary on all the errors as mentioned in https://github.com/etcd-io/etcd/issues/18493#issuecomment-2345842185
Remove line 72-79 below to avoid any confusion. Note that removing them doesn't change the behaviour. https://github.com/etcd-io/etcd/blob/c79c7d5440f53600836f9d1d69441ef5541f719a/client/v3/retry_interceptor.go#L72-L79

Oct 02 '24 09:10 ahrtr

If previous request was actually successful on the server side, then there will be (at least) two revisions for the same key after the retry (assuming the retry is successful). It's a problem in such case, but not a big problem, because etcd guarantees linearizablity, which means the clients always read the latest data (revision). The other (minor) problem is that the watch clients may get two events against the the same key.

Why is that not a big problem? If a Put RPC request executes twice because is retried in this way by a client, it can be interleaved with other writes. It doesn't need to be in a transaction to be a problem.

t0 client 1 starts RPC [Put A, 1]. t1 etcd server services RPC [Put A, 1] on behalf of client 1. t2 client 2 starts RPC [Put A, 2]. t3 etcd server services RPC [Put A, 2] on behalf of client 2. t6 client 2 receives success on its Put that was started at t2, the Put call returns. t4 client 1 receives a failure indication of its Put started at t0 (despite the fact the Put actually succeeded being serviced in the etcd server). client 1 does a retry, sends [Put A, 1] again t5 etcd server services [Put A, 1] on behalf of client 1 Put request (retry) that started at t4 t6 client 1 receives success on its retry for [Put A, 1] at t4. The original call for the [Put A, 1] at t0 returns

You can argue that the Put from client 1 that started at t0 never returned before t6, therefore from a happens-before perspective whatever overall system or algorithm the combination of client 1 and client 2 are trying to implement, was not violated. However, this system never intended to make 3 writes, and could be assuming from other context (eg, external conditions providing guarantees for when writes would be triggered) than only 2 writes happened. This is what the jepsen report describes as "lost update", no? (see section 4. Discussion)

Retrying non-idempotent operations is a problem. Put RPCs are not idempotent, even outside of a transaction. Calling that problem big or small is to make assumptions about application level use cases.

Feb 16 '25 03:02 jcferretti

Either this comment is wrong or the go etcd client suffers from the same problem under transactions that jetcd has, as reported by jepsen:

https://github.com/etcd-io/etcd/blob/eb7607bd8b3665d14aa40d50435ae8c9002d620c/client/v3/options.go#L42

Namely: gRPC error code UNAVAILABLE is what jetcd was receiving from the server in the jepsen scenario and what made it retry the transactions (this follows from the jetcd implementation code here and here). If the go etcd client is doing the same (which the quoted comment above seems to indicate), it should trigger the same problem under the same server behavior.

Feb 16 '25 17:02 jcferretti

Either this comment is wrong or the go etcd client suffers from the same problem [...]

It would seem the comment is wrong, given:

https://github.com/etcd-io/etcd/blob/eb7607bd8b3665d14aa40d50435ae8c9002d620c/client/v3/retry.go#L93

meaning, the condition to retry on gRPC Status code UNAVAILABLE is a conjunction with a more complicated condition than just UNAVAILABLE.

I don't see how that code in isSafeRetryMutableRPC makes sense, however, given my understanding of how gRPC client side logic is supposed to work.

All gRPC load balancing strategies keep trying to reconnect to a lost endpoint (or even for a new connection to a never reached before endpoint, for the default case of lazy initialization of a Channel on first actual RPC attempt on it), or to an alternative endpoint (depending on "pick-first" or "round-robin").

gRPC stubs have a wait for ready option that can be set to make the initial connection attempt avoid failing immediately on the first attempt to connect a channel (which is lazy by default, it happens as part of the first actual RPC attempt, not at channel creation). The default in gRPC for wait for ready is false, to avoid confusing vanilla command line clients and privilege their interactivity (it is confusing to see a command line tool hang forever on a typo for an endpoint parameter). Not setting wait for ready explicitly to true (avoiding the default of false) in a client for a high availability service like etcd wouldn't make sense to me however, so I am assuming wait for ready is set to true by etcd go client.

But for a channel that was successfully connected and drops (presumably the jensen case, since AFAIK the failure scenario involved etcd server leader failover), even independent of wait for ready the reconnection attempts by default will continue forever, unless a deadline is actually defined for the RPC. If a deadline is defined, then the status code is DEADLINE_EXCEEDED, not UNAVAILABLE. So I am not sure how ever can the go etcd client get UNAVAILABLE unless the server is actually sending that, I don't see how gRPC internal state transitions client side (only) would generate UNAVAILABLE.

From: https://github.com/grpc/grpc-go/issues/319, which seems to be a related question from etcd team to gRPC team:

As far as I understand the answer to xiang90 question

https://github.com/grpc/grpc-go/issues/319#issuecomment-176384113

is a qualified "yes". Qualified because "regardless of connection error" is a bit misleading as a way to pose the question. A connection error in this context is a failure to connect a socket to a subchannel endpoint inside the gRPC implementation. At that point there is no application level visible distinction of failure, so the "regardless" phrasing makes an assumption that doesn't hold. The failure may be for different reasons, eg,

the host part of host:port endpoint may not be available in DNS in this attempt
the kernel in the host may refuse the connection because there is no processes listening on that port
the machine may be down and the connection attempt timed out at the TCP level

All the cases above are different failures to connect but on all of them gRPC will keep trying as long as there is deadline budget, AFAIK.

So, summarizing:

I think it is worth exploring if the etcd server itself is sending gRPC status code UNAVAILABLE. I don't think we can make sense of what jepsen saw for the jetcd client otherwise: jetcd does have waitForReady=true by default: https://github.com/etcd-io/jetcd/blob/a240dc6a74c82b96a4aab898a181bf35283dd386/jetcd-core/src/main/java/io/etcd/jetcd/ClientBuilder.java#L70, and my reading of jetcd implementation code here and here implies that the behavior jepsen reports can only be explained if jetcd was retrying triggered on gRPC status code UNAVAILABLE received from the server. If UNAVAILABLE is being sent by the server, which I believe should be the case since it follows from the analysis above, then it would make sense to consider sending different, more specific errors, potentially for different cases with a clear indication for whether the client can retry or not.
I don't think that retry logic in the go etcd client on retry.go isSafeRetryMutableRPC is sensible. I don't think UNAVAILABLE can be triggered by the client side unless all of these conditions hold true: (a) wait for ready is false (b) there is no deadline for the RPC (c) this is the first RPC attempted by the client on a (as per gRPC default) lazily initialized channel. I can't image how either of (a) or (b) would hold, even less so both of them. So I don't think the condition on line 93 of retry.go can ever be true.

References

gRPC client side connectivity state transitions
gRPC retry and wait for ready
gRPC status codes
gRPC load balancy policy behaviors Note "load balancing policy" here is misleading as a name; this defines behavior for connectivity to multiple endpoints and failover state machine client side; the fact that some policies "load balance" between the endpoints (eg, round robin) is not the most relevant property here for a fault tolerance discussion

Feb 16 '25 19:02 jcferretti

The "Break Read Committed" section in this issue description is confused about a definition of a dirty read in the context of etcd. I don't believe the example is right, and I also believe it contributes to make matters more confused so it would be best to remove it. For the example to make sense we would need to understand what it means to have a dirty read in the context of an etcd transaction (not single Puts), and it doesn't seem to me the example can be made to work /for this bug/ without involving a lot more (maybe it is assumed in which case it would benefit from being made explicit). For a strict definition of read committed consider https://jepsen.io/consistency/models/read-committed. There is a clear example of lost update involving etcd transactions (not single Puts) in section 3.1 Lost Update of https://jepsen.io/analyses/jetcd-0.8.2 (which is what motivated this issue), which mentions "lost update is prohibited by read commited".

So if we want an example we can use that. But I am not actually sure we need an example. As the later discussion on this issue describes (ahrtr on Sep 14, 2024), the important aspect to emphasize here for clarity of readers is that client-side code (be it client library or application level) cannot guarantee that an RPC that returned an error means the operation did not actually execute in the etcd server. For the same reason, a client library cannot blindly retry RPCs that may have executed non-idempotent state transitions. In the context of etcd client code this means retrying a Get RPC is fine, retrying a Put or Txn is not. Retries causing a transaction to be executed twice is what explains behavior that to client side code would look the same as a failure of read committed would look like. But what actually happened is the transaction running twice without the client noticing it. Focusing on the symptoms of a failure "looking like" read committed (for a transaction... read committed in the context of lone Puts doesn't have a meaning) loses track of the actual root cause, IMHO.

The language even without the client's retry, it's also possible for users to run into this "issue", with issue in quotes, doesn't help clarify matters either, because is not explicitly clear what "issue" in quotes is. Is not using the word issue to refer to this github issue. The best interpretation here would be by "issue" is trying to mean "client-side code cannot guarantee that an RPC that returned an error means the operation did not actually execute in the server". I believe saying that explicitly would be better if that's what we mean. We can say out loud that is a property of distributed systems and RPCs in general and not something etcd specific, and quote the gRPC page on status codes that states for UNAVAILABLE "Note that it is not always safe to retry non-idempotent operations"

Feb 16 '25 22:02 jcferretti

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

May 25 '25 00:05 github-actions[bot]

etcd Go & Java client SDK's retry mechanism may break `Serializable`

Background

Breaks Serializable

Break Read Committed

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Proposal

Action

Reference

Key space

Non-key space

Breaks `Serializable`

Break `Read Committed`