solr icon indicating copy to clipboard operation
solr copied to clipboard

SOLR-16505: Switch UpdateShardHandler.getRecoveryOnlyHttpClient to Jetty HTTP2

Open iamsanjay opened this issue 1 year ago • 6 comments

https://issues.apache.org/jira/browse/SOLR-16505

Switch UpdateShardHandler.getRecoveryOnlyHttpClient to Jetty HTTP2

In UpdateShardHandler.java removed the old reference to Apache HTTP client for recoveryUpdate, and initialized with Jetty HTTP client. The recovery client is only used by RecoveryStrategy.java to recreate the builder with the existing HTTPClient. In RecoveryStrategy.java, sendPrepRecoveryCmd sending async request using HttpUriRequest, After replacing it with newer client there is no option to create HttpUriRequest. However, we can send asyncRequest using executor. Upon calling, withExecutor() to set the executor, it did not work. Further inspection revealed that Http2SolrClient only set executor when they creating new HttpClient, In case, if builder is initialized with existing client there is no way to set the executor. I added the new code to tackle that issue. But now we have two different executor: one for Http2SolrClient and another one for HttpClient.

Tests

Added stressRecoveryTest.

Checklist

Please review the following and check all that apply:

  • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [x] I have developed this patch against the main branch.
  • [ ] I have run ./gradlew check.
  • [x] I have added tests for my changes.
  • [ ] I have added documentation for the Reference Guide

iamsanjay avatar Feb 18 '24 05:02 iamsanjay

Thank you @dsmiley for reviewing this PR... I'm excited for the work @iamsanjay has been doing, and I do NOT feel comfortable reviewing/merging code in this area ;-)

epugh avatar Feb 20 '24 18:02 epugh

I was thinking about the executor in the Http2SolrClient class.

What is the issue?

  1. Http2SolrClent calls createHttpClient to create a Jetty HTTP client. By default, If executor is not provided then a executor will be created for it to initialize the Jetty HTTP.
   private HttpClient createHttpClient(Builder builder) {
    HttpClient httpClient;

    executor = builder.executor;
    if (executor == null) {
      BlockingArrayQueue<Runnable> queue = new BlockingArrayQueue<>(256, 256);
      this.executor =
          new ExecutorUtil.MDCAwareThreadPoolExecutor(
              32, 256, 60, TimeUnit.SECONDS, queue, new SolrNamedThreadFactory("h2sc"));
      shutdownExecutor = true;
    } else {
      shutdownExecutor = false;
    }
  1. However, If someone creates Http2SolrClient.Builder with existing Jetty HTTP client by calling withHttpClient there will be no executor available to them, even if withExecutor method is called. Therefore, all the asyncRequest will result in NullPointerException. Because only createHttpClient is the only method which handles the initialization of executor and If new builder is created is using existing Jetty client then no call is made to createHttpClient and hence no executor.

Solution There are two ways that I am thinking as of now to handle it, in case If new builder is created is using the existing Jetty HTTP client.

  1. In Http2SolrClient constructor, below snipped of code is added to check whether executor is null or not. And then re-initlialize the executor at Http2SolrClient level. At this point, we have two executors: one for Http2SolrClient and another for Jetty Http client. And Http2SolrClient is only responsible for closing the ones they making from inside, and If you are providing any Executor from outside, then user will be responsible for closing it.
if (this.executor == null) {
        this.executor = http2SolrClient.executor;
      }
  1. Second, As we recreate builder using existing Jetty Client by calling withHttpClient, there we can also add code that would also set executor(one which we created calling createHttpClient) for Http2SolrClient as well among with other properties. But that means that Every builder recreated using the existing Jetty client will have access to executor without them knowing about it. Below is that executor. Something I am not much comfortable about, I believe If user creating Http2SolrClient to send asyncRequest then they should provide the implementation for executor.

Also In this option there will be code to check wh

 BlockingArrayQueue<Runnable> queue = new BlockingArrayQueue<>(256, 256);
      this.executor =
          new ExecutorUtil.MDCAwareThreadPoolExecutor(
              32, 256, 60, TimeUnit.SECONDS, queue, new SolrNamedThreadFactory("h2sc"));

@dsmiley wdyt? I have changed the code to implement the second option for now.

@stillalex I see that we transferred some properties from older Http2SolrClient as we call withHttpClient to create new Http2SolrClient but the exisitng executor was never used for the new builder. Was there any reason for that?

iamsanjay avatar Feb 21 '24 04:02 iamsanjay

@dsmiley I was working on IndexFetcher to change their client to Http2SolrClient. Suddenly all the request in IndexFetcher started failing due to authentication issue. Further inspection revealed that the listener required for Auth is not registered for recoveryOnlyClient. Interestingly, no calls in RecoveryStrategy requires Auth, and therefore no request failed there.

Below is the code, that called on updateOnlyClient to register the Auth listener.

https://github.com/apache/solr/blob/1c6798b0a012f47ab88ed091b9015462016dc8e8/solr/core/src/java/org/apache/solr/update/UpdateShardHandler.java#L317-L319

However, only calling this method for recoveryOnlyClient won't work. As recreating Http2SolrClient using withHttpClient won't pass on listeners. Basically same issue that we have faced with Executors.

iamsanjay avatar Mar 03 '24 17:03 iamsanjay

Thanks for digging into the authentication issue further; this is shining a light on an issue I was vaguely aware of but now I can see it directly. And I'm so glad there are tests; I thought it wasn't tested. Can you name one please?

One thing will help; stop excessive creation of new SolrClient instances when we can re-use the same one that is configured with this auth listener thing. Thus don't have fields on classes that hold an HttpClient (we'll deem this bad/questionable practice); instead hold an Http2SolrClient. If due to customized timeouts we need a new Http2SolrClient, we could ensure that the builder withHttpClient(http2SolrClient) copies listeners (and anything else like an executor; why not). That could be its own JIRA.

dsmiley avatar Mar 05 '24 00:03 dsmiley

TestPullReplicaWithAuth.testPKIAuthWorksForPullReplication is the one.

On a side note, How about we rethink the design decorating the request. Right now, we recreate the Http2SolrClient to override socketTimeout and connTimeout. Instead of recreating Http2SolrClient, we introduced another method which would override Timeouts.

request(SolrRequest<?> solrRequest, String collection, int socketTimeout, int connTimeout)

iamsanjay avatar Mar 05 '24 05:03 iamsanjay

On a side note, How about we rethink the design decorating the request. Right now, we recreate the Http2SolrClient to override socketTimeout and connTimeout.

Which is a fine approach, IMO. They are light.

Instead of recreating Http2SolrClient, we introduced another method which would override Timeouts.

I was going to mention something like this but I'm opposed to another request() method like you propose. Instead, SolrRequest could have the timeouts. It already has other methods of a low-ish level like headers, URL, preferred nodes. Just keep in mind we have 2 clients and another on the way so 2-3 places to update to consider this timeout.

dsmiley avatar Mar 05 '24 05:03 dsmiley

Allow Compression

HttpSolrClient provides feature to enable external compression that would add "ACCEPT_ENCODING: gzip" to the request headers. And we are using this feature in IndexFetcher.class while creating HttpSolrClient. https://github.com/apache/solr/blob/a1104bd04a799ba99cba652922bfded51d0c2d72/solr/solrj/src/java/org/apache/solr/client/solrj/impl/HttpClientUtil.java#L428-L433

Now Jetty Http client by default add "ACCEPT_ENCODING: gzip" header to all the request.

However, in Http2SolrClient2.java we have removed that header while decorating request.

https://github.com/apache/solr/blob/a1104bd04a799ba99cba652922bfded51d0c2d72/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L632-L634

Added new commit that would allow users to enable external compression at request level. Now I only enabled for one request in IndexFetcher, I have to do for all the request if useExternalCompression is true. Or, we can add this functionality to Http2SolrClient which would add for all the request automatically.

Few cases are failing which I am still debugging!

iamsanjay avatar Mar 07 '24 12:03 iamsanjay

There will be no need to add the extra code at SolrRequest level. I can just call add header for SolrRequest to add "ACCEPT_ENCODING:gzip"

iamsanjay avatar Mar 07 '24 12:03 iamsanjay

Test case which is failing.

./gradlew :solr:core:test --tests "org.apache.solr.handler.TestHealthCheckHandlerLegacyMode.doTestHealthCheckWithReplication" -Ptests.jvms=2 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=6C9BE6E7222DDA0B -Ptests.file.encoding=US-ASCII -Ptests.verbose=true

And If I remove the code which is removing the ACCEPT_ENCODING header in Http2SolrClient then it works fine! req.headers(headers -> headers.remove(HttpHeader.ACCEPT_ENCODING));

iamsanjay avatar Mar 07 '24 13:03 iamsanjay

So far we were only testing internalCompression but not the externalComrpession. So in one of the class, ReplicationTestHelper, I found a way to enable the external Compression, and now whenever it's on that would start sending ACCEPT-ENCODING header.

I still have not found a viable way to test the external Compression. As in I can confirm that header being send when externalCompression is enabled but how it affects the stream is not tested yet. Through Wireshark I try to trace these calls but not able to trace Http2 calls, only http1.

iamsanjay avatar Mar 12 '24 15:03 iamsanjay

Glad that you did the compression testing manually! I think at this point the PR just needs a few method/field renames to reflect to not confuse a SolrClient with an HttpClient. And there's the IndexFetcher basic authentication that's obsolete.

dsmiley avatar Mar 14 '24 19:03 dsmiley

IndexFetcher getStream fails, seen it multiple times now. Blocker!

ERROR (recoveryExecutor-452-thread-1-processing-unloadcollection_shard1_replica4 null-484 core_node5 127.0.0.1:42871_solr unloadcollection shard1) [n:127.0.0.1:42871_solr c:unloadcollection s:shard1 r:core_node5 x:unloadcollection_shard1_replica4 t:null-484] o.a.s.h.IndexFetcher Error fetching file, doing one retry... 2> => org.apache.solr.common.SolrException: Unable to download _7_Lucene99_0.tip completely. Downloaded 0!=76 2> at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1938) 2> org.apache.solr.common.SolrException: Unable to download _7_Lucene99_0.tip completely. Downloaded 0!=76 2> at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1938) ~[main/:?] 2> at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1801) ~[main/:?] 2> at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1775) [main/:?] 2> at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1195) [main/:?] 2> at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:682) [main/:?] 2> at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:419) [main/:?] 2> at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:467) [main/:?] 2> at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:243) [main/:10.0.0-SNAPSHOT 9775e47757b251257633d12a58da38228932fbd4 [snapshot build, details omitted]] 2> at org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:701) [main/:10.0.0-SNAPSHOT 9775e47757b251257633d12a58da38228932fbd4 [snapshot build, details omitted]] 2> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:333) [main/:10.0.0-SNAPSHOT 9775e47757b251257633d12a58da38228932fbd4 [snapshot build, details omitted]] 2> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:309) [main/:10.0.0-SNAPSHOT 9775e47757b251257633d12a58da38228932fbd4 [snapshot build, details omitted]] 2> at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:212) [metrics-core-4.2.25.jar:4.2.25]

iamsanjay avatar Mar 19 '24 13:03 iamsanjay

GO_AWAY frame is used to terminate the connection. In our case, Jetty server is sending GO_AWAY to terminate the connection as we sending multiple "GetStream" requests to fetch all the files.

Caused by: java.nio.channels.AsynchronousCloseException at org.eclipse.jetty.http2.client.http.HttpConnectionOverHTTP2.close(HttpConnectionOverHTTP2.java:217) ~[http2-http-client-transport-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2.onClose(HttpClientTransportOverHTTP2.java:175) ~[http2-http-client-transport-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2$SessionListenerPromise.onClose(HttpClientTransportOverHTTP2.java:194) ~[http2-http-client-transport-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.client.http.HTTPSessionListenerPromise.onClose(HTTPSessionListenerPromise.java:101) ~[http2-http-client-transport-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.api.Session$Listener.onClose(Session.java:260) ~[http2-common-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.HTTP2Session.notifyClose(HTTP2Session.java:1197) ~[http2-common-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.HTTP2Session$StreamsState.terminate(HTTP2Session.java:2153) ~[http2-common-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.HTTP2Session$StreamsState.lambda$sendGoAwayAndTerminate$16(HTTP2Session.java:2059) ~[http2-common-10.0.20.jar:10.0.20]

Now that Solr is not the one who's terminating the connection, I am interested in Jetty logs. @dsmiley Does jetty uses separate files for logging? If yes, can you please tell me How can I access that file, especially in test environment.

iamsanjay avatar Mar 27 '24 07:03 iamsanjay

For now, I fixed other issue related to closing output stream in IndexFetcher. IndexFetcher is closing the outputstream in unwanted manner which fails the retry operation. Now If server terminates the connection, IndexFetcher will not close the outputstream.

iamsanjay avatar Mar 27 '24 07:03 iamsanjay

Used LogLevel annotation to generate the DEBUG logs from Jetty. However, the excessive logging reduce the likelihood of reproducing the failure. So I restrict the logging to one class.

@SuppressSSL // Currently, unknown why SSL does not work with this test
@LogLevel("org.eclipse.jetty.http2.HTTP2Connection=DEBUG")
public class TestHealthCheckHandlerLegacyMode extends SolrTestCaseJ4 {

Below is the new exception observed in the logs related to terminating the connection.

DEBUG (qtp803109855-19) [n: c: s: r: x: t:] o.e.j.h.HTTP2Connection Processing session failure on HTTP2ServerSession@1674feca{local:/127.0.0.1:50713,remote:/127.0.0.1:50719,sendWindow=938358,recvWindow=1048576,state=[streams=0,CLOSING,goAwayRecv=null,goAwaySent=GoAwayFrame@ca473bc{847/enhance_your_calm_error/invalid_rst_stream_frame_rate},failure=java.io.IOException: enhance_your_calm_error/invalid_rst_stream_frame_rate]} 2> => java.io.IOException: enhance_your_calm_error/invalid_rst_stream_frame_rate 2> at org.eclipse.jetty.http2.HTTP2Session.toFailure(HTTP2Session.java:633) 2> java.io.IOException: enhance_your_calm_error/invalid_rst_stream_frame_rate 2> at org.eclipse.jetty.http2.HTTP2Session.toFailure(HTTP2Session.java:633) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.HTTP2Session$StreamsState.onSessionFailure(HTTP2Session.java:2006) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.HTTP2Session.onSessionFailure(HTTP2Session.java:578) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.HTTP2Session.onConnectionFailure(HTTP2Session.java:573) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.HTTP2Connection.onConnectionFailure(HTTP2Connection.java:303) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.parser.BodyParser.notifyConnectionFailure(BodyParser.java:218) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.parser.BodyParser.connectionFailure(BodyParser.java:210) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.parser.ResetBodyParser.onReset(ResetBodyParser.java:92) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.parser.ResetBodyParser.parse(ResetBodyParser.java:61) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.parser.Parser.parseBody(Parser.java:240) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.parser.Parser.parse(Parser.java:167) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.parser.ServerParser.parse(ServerParser.java:126) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.http2.HTTP2Connection$HTTP2Producer.produce(HTTP2Connection.java:350) [http2-common-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.produceTask(AdaptiveExecutionStrategy.java:455) [jetty-util-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:248) [jetty-util-10.0.20.jar:10.0.20] 2> at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.produce(AdaptiveExecutionStrategy.java:193) [jetty-util-10.0.20.jar:10.0.20]

Error

org.eclipse.jetty.io.EofException: Close enhance_your_calm_error/ (invalid_rst_stream_frame_rate)

As per RFC https://datatracker.ietf.org/doc/html/rfc9113#name-error-codes

ENHANCE_YOUR_CALM (0x0b): The endpoint detected that its peer is exhibiting a behavior that might be generating excessive load.

RST_STREAM

The Client is sending RST_STREAM frame to terminate the connection. And on the server side there is a rateControl code to mitigate the HTTP/2 Rapid Reset attack

https://github.com/jetty/jetty.project/blob/89c41b2550ed367a25d1664da8843f5a4e1019da/jetty-core/jetty-http2/jetty-http2-common/src/main/java/org/eclipse/jetty/http2/parser/ResetBodyParser.java#L88-L92

private boolean onReset(ByteBuffer buffer, int error)
    {
        ResetFrame frame = new ResetFrame(getStreamId(), error);
        if (!rateControlOnEvent(frame))
            return connectionFailure(buffer, ErrorCode.ENHANCE_YOUR_CALM_ERROR.code, "invalid_rst_stream_frame_rate");
        reset();
        notifyReset(frame);
        return true;
    }

The HTTP/2 Rapid Reset attack

This attack is called Rapid Reset because it relies on the ability for an endpoint to send a RST_STREAM frame immediately after sending a request frame, which makes the other endpoint start working and then rapidly resets the request. The request is canceled, but leaves the HTTP/2 connection open. For more details https://cloud.google.com/blog/products/identity-security/how-it-works-the-novel-http2-rapid-reset-ddos-attack

Jetty resolved it here https://github.com/jetty/jetty.project/issues/10679

What is the rate value?

In Solr, we haven't configured any value and IMO we are using the default value - 128.

https://github.com/jetty/jetty.project/blob/89c41b2550ed367a25d1664da8843f5a4e1019da/jetty-core/jetty-http2/jetty-http2-server/src/main/java/org/eclipse/jetty/http2/server/AbstractHTTP2ServerConnectionFactory.java#L76-L78

Next?

IndexFetcher does make lot of requests to download the whole index via GetStream. What next?

  1. Find out why GetStream sending too many RST_STREAM frames?
  2. Check out the HTTP2Client parameters to reduce the excessive load on server.

iamsanjay avatar Mar 28 '24 12:03 iamsanjay

Fix for RST_STREAM!

Upon enabling logs at "org.eclipse.jetty" level. It was found that IndexFetcher was closing the InputStream prematurely, causing the client to send RST_FRAME and flooding the server with RESET Frame and eventually hitting the ceiling.

The solution involved modifying the IndexFetcher to continue reading from the InputStream until it receives a zero-length packet before closing the stream.

PS :- Our friends at Jetty helped. Thank you @sbordet!

iamsanjay avatar Mar 30 '24 06:03 iamsanjay

If there's anything I can do to help here, let me know.

dsmiley avatar Apr 13 '24 20:04 dsmiley

Can we remove the Legacy Auth mechanism?

No

In User managed cluster, the only communication, IMO, happening is downloading of index from one node to another. The PKIAuthenticationPlugin only works in cloud environment and the condition that enclosed the PKI initialization logic checks whether it isZookeeperAware(). Of course! that condition evaluates to false in UserManaged cluster and therefore the method which is responsible of setting auth headers never gets called.

https://github.com/apache/solr/blob/c3c83ffb8dba17dd79f78429df65869d1b7d87bb/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L836-L855

https://github.com/apache/solr/blob/c3c83ffb8dba17dd79f78429df65869d1b7d87bb/solr/core/src/java/org/apache/solr/core/CoreContainer.java#L626-L631

Will be adding a test case to test Legacy replication when Basic auth is enabled

iamsanjay avatar Apr 16 '24 10:04 iamsanjay

Below test failed!

./gradlew :solr:core:test --tests "org.apache.solr.cloud.OverseerStatusTest.test" -Ptests.jvms=96 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=5C4EADB21CB7FBF1 -Ptests.timeoutSuite=600000! -Ptests.file.encoding=US-ASCII

Exception

 java.lang.NullPointerException
   >         at __randomizedtesting.SeedInfo.seed([5C4EADB21CB7FBF1:D41A9268B24B9609]:0)
   >         at org.apache.solr.cloud.OverseerStatusTest.test(OverseerStatusTest.java:78)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   >         at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:80)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)

git bisect 61a67c0

CC @noblepaul

iamsanjay avatar Apr 22 '24 16:04 iamsanjay

Build failed due to one of the flaky test cases as test passing when ran again.

gradlew :solr:core:test --tests "org.apache.solr.search.TestCoordinatorRole.testNRTRestart" -Ptests.jvms=96 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=383FA9310A6D5B37 -Ptests.timeoutSuite=600000! -Ptests.file.encoding=US-ASCII

The error message that we are getting

initial error time threshold exceeded

https://github.com/apache/solr/blob/033be0fc1317b48992fb756aac7f8bd55731fc3b/solr/core/src/test/org/apache/solr/search/TestCoordinatorRole.java#L462-L483

I will try to see If i can get more details on this. Should I open a Jira ticket for this?

One observation: Above test case is setting up the cluster outside the try block. Therefore, If there is any issue occurred during initialization then it would never shut down the cluster.

https://github.com/apache/solr/blob/033be0fc1317b48992fb756aac7f8bd55731fc3b/solr/core/src/test/org/apache/solr/search/TestCoordinatorRole.java#L187-L200

iamsanjay avatar May 01 '24 07:05 iamsanjay

I looked at the test history -- seems flaky. I found a JIRA issue where you can relay these comments.

Looking forward to the lingering small matters being addressed so we can get this merged!

dsmiley avatar May 01 '24 19:05 dsmiley

Thinking out loud! The new logic, however bit ugly it is because it gives freedom to manipulate the Http2SolrClient directly instead of creating it through Builder patter -- anti-pattern, to copy the listeners from older Http2SolrClient to the new Http2SolrClient is added only because we needed new Http2SolrClient due to the custom timeouts configuration for replication logic.

Is there any way to avoid recreating the Http2SolrClient?

The default socketTimeout and connectionTimeout for UpdateShardHandler http client is 60 secs for both, and for IndexFetcher Recovery client It is 120 and 60 seconds respectively. Can we deprecate or basically remove the older way of setting these timeouts at replication Handler level and use the UpdateShardHandler config timeouts i.e. 60 seconds for both socket and connection? @dsmiley @markrmiller

iamsanjay avatar May 08 '24 04:05 iamsanjay

The most important part of Http2SolrClient to be re-used (instead of re-created) is Jetty's HttpClient. Creating another Http2SolrClient that uses an existing HttpClient isn't too bad but yeah it'd be nice if we didn't have to re-create one. This PR is an improvement! IndexFetcher.getLatestVersion, fetchFileList, getStream, and getDetails not only re-use the underlying HttpClient, as before, but now use the SolrClient wrapper, which is more succinct & simple in so doing.

Some thoughts on the commit message summarizing this long one:

  • UpdateShardHandler
    • switch "recoveryOnlyHttpClient" to Http2SolrClient
  • RecoveryStrategy:
    • Use Http2SolrClient
    • Simplify cancelation of prep recovery command
  • IndexFetcher:
    • Use Http2SolrClient
    • Ensure the entire stream is consumed to avoid a connection reset
  • Http2SolrClient:
    • make HttpListenerFactory configurable (used for internal purposes)

It's unclear if there was any change with respect to gzip / "useExternalCompression" -- this part of the diff is confusing.

dsmiley avatar May 08 '24 14:05 dsmiley

Unless I hear to the contrary, I'll commit this tonight like midnight EST. Just "main" for a bit.

The CHANGES.txt is currently:

SOLR-16505: Switch UpdateShardHandler.getRecoveryOnlyHttpClient to Jetty HTTP2 But want to reword do SOLR-16505: Use Jetty HTTP2 for index replication and other "recovery" operations.

That will be more meaningful to users. CHANGES.txt needs to speak to users.

dsmiley avatar May 08 '24 18:05 dsmiley