ratis icon indicating copy to clipboard operation
ratis copied to clipboard

RATIS-1968. Remove unreached else block of trySendDelayed

Open symious opened this issue 2 years ago • 8 comments

What changes were proposed in this pull request?

In "https://github.com/apache/ratis/blob/9bd82aa2aa125a749b9357f208dd66533f43374c/ratis-common/src/main/java/org/apache/ratis/util/SlidingWindow.java#L363", the else block should not be reached, since only the first request is send out, all other request are still in the "delayedRequests", we can not receive out-of-order replies if "firstReplied" is not set.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-1968

Please replace this section with the link to the Apache JIRA)

How was this patch tested?

No tests.

symious avatar Dec 13 '23 01:12 symious

Since testWithLoadAsync fails, it probably needs the else. Sorry that the code was long time ago. I might not remember the details. It may be related to failover to another server (i.e. leader change).

szetszwo avatar Dec 13 '23 04:12 szetszwo

Yes, seems the variables are reset on exceptions. I will try to dive deeper, if needed, I'll close the PR then.

symious avatar Dec 13 '23 05:12 symious

@symious , I recall now -- when there is a failover, there could possibly be multiple calls needed to be retried. Since the retires are async, some other calls can be retried before the first call.

In short, client calls are always in order but retries may be out of order.

szetszwo avatar Dec 13 '23 06:12 szetszwo

@szetszwo The error seems caused by repeating resets on the slidingWindow, could you help to check?

symious avatar Dec 13 '23 08:12 symious

The error seems caused by repeating resets on the slidingWindow, ...

Where is the code?

szetszwo avatar Dec 13 '23 20:12 szetszwo

repeating reset.

https://github.com/apache/ratis/blob/cedcd2ad4cd2da13230aaa0d15678e5aee9b0729/ratis-client/src/main/java/org/apache/ratis/client/impl/OrderedAsync.java#L257C61-L257C61

Steps:

  1. Client send 100 async requests to server, here firstReplied is true
  2. Leader changed, 100 requests replied with exceptions, trying to reset the slidingWindow.
  3. Say the requests id are [100, 199], client receives "id100" and and reset the slidingWindow, firstReplied changed to false.
  4. Client retry on "id100" and receives the reply, so set firstReplied to true.
  5. Client then receives other requests' exception reply from Step2, then reset the slidingWindow again, but the connection is ok now, no need to reset again.

@szetszwo

symious avatar Dec 14 '23 02:12 symious

The failAllAsyncRequests may fail with a different exception (AlreadyClosedException may work, or we should add a new AlreadyFailedException). For that exception, do not resetSlidingWindow.

szetszwo avatar Dec 14 '23 02:12 szetszwo

@szetszwo I think failAllAsyncRequests is not invoked here, the slidingWindow is reset, then CompletionException(e) is thrown and failAllAsyncRequests is skipped. (Correct me if I'm wrong.)

symious avatar Dec 15 '23 10:12 symious