solr Benchmark: HTTP2 fix cancel_stream

Fixes Jetty HTTP2 cancel_stream_error by removing the flow control limit.

I was encountering this issue with CloudIndexing, esp when using single-node, single-replica, single-shard -- like 30K ops/sec. It'd routinely fail. I bumped this to 1000, then still hit it sometimes, then bumped to 2000. I haven't run it much since then to be honest. Then I saw how to disable it, which looks a little weird TBH (the default method returns a NO_RATE_CONTROL impl).

Aug 01 '24 20:08 dsmiley

Could you please walk me through the process of running this specific benchmark which is giving the error, such as a command, or anything which helps me re-produce it on my system.

Aug 15 '24 14:08 iamsanjay

crave run -- ./jmh.sh CloudIndexing -jvmArgsAppend -Dsolr.bench.seed=823716671780822 -p useSmallDocs=true -p nodeCount=1 -p numShards=1 -p numReplicas=1 -I 8

The benchmark module uses JettySolrRunner via MiniClusterState.

Googling for this is inconclusive as to the cause. It seems the server is choosing to cancel the client's stream, but we don't know why. I dug into this before and thought it was a rate limiting mechanism, so this PR disables that. Perhaps the next step to debug is to ascertain what jetty config/logging would explain that. Would want to avoid debug level on everything -- too much output.

Aug 15 '24 16:08 dsmiley

I recall encountering a somewhat similar error in the IndexFetcher class when we transitioned to HTTP2. The IndexFetcher class was prematurely closing the client connection and sending a GOAWAY packet to the server. When the flow control limit was reached, it triggered an exception.

I suspect the same issue might be occurring in this case. Unfortunately, I can't run the crave script anymore since upgrading to the new Apple Silicon chip. I have to look into that.

Aug 22 '24 04:08 iamsanjay

Was there anyway to enable more logs for benchmark? I wonder If it's too much overhead with the current Jetty configuration.

org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:57165/solr/testCollection/update?wt=javabin&version=2: Async exception during distributed update: Error from server at http://127.0.0.1:57164/solr: null



request: http://127.0.0.1:57164/solr
Remote error message: Server error writing document id 5234 to the index.
        at org.apache.solr.client.solrj.impl.HttpSolrClientBase.processErrorsAndResponse(HttpSolrClientBase.java:280)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:549)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:505)
        at org.apache.solr.bench.index.CloudIndexing.indexDoc(CloudIndexing.java:138)
        at org.apache.solr.bench.index.jmh_generated.CloudIndexing_indexDoc_jmhTest.indexDoc_thrpt_jmhStub(CloudIndexing_indexDoc_jmhTest.java:242)
        at org.apache.solr.bench.index.jmh_generated.CloudIndexing_indexDoc_jmhTest.indexDoc_Throughput(CloudIndexing_indexDoc_jmhTest.java:111)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:527)

Sep 19 '24 04:09 iamsanjay

Maybe you are unfamiliar with Log4j but look at log4j2-bench.xml in this module.

Sep 19 '24 13:09 dsmiley

Benchmark CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc

MaxConnectionsPerDestination

Benchmark CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc

MaxConnectionsPerDestination

Benchmark CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc

MaxConnectionsPerDestination

Benchmark CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc

MaxConnectionsPerDestination 8

(nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units 4 1 4 50000 true thrpt 4 16140.035 ± 1902.453 ops/s 4 1 4 50000 false thrpt 4 1818.103 ± 42.214 ops/s 4 3 4 50000 true thrpt 4 7280.289 ± 1774.366 ops/s 4 3 4 50000 false thrpt 4 1546.771 ± 2012.373 ops/s 16 (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units 4 1 4 50000 true thrpt 4 16374.754 ± 1354.421 ops/s 4 1 4 50000 false thrpt 4 2053.636 ± 625.117 ops/s 4 3 4 50000 true thrpt 4 7919.480 ± 4504.359 ops/s 4 3 4 50000 false thrpt 4 1760.070 ± 2226.050 ops/s 32 (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units 4 1 4 50000 true thrpt 4 15566.625 ± 1666.612 ops/s 4 1 4 50000 false thrpt 4 2084.775 ± 239.749 ops/s 4 3 4 50000 true thrpt 4 7505.612 ± 4698.342 ops/s 4 3 4 50000 false thrpt 4 1684.325 ± 2826.760 ops/s 64 (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units 4 1 4 50000 true thrpt 4 15576.925 ± 1727.535 ops/s 4 1 4 50000 false thrpt 4 2111.832 ± 189.408 ops/s 4 3 4 50000 true thrpt 4 8069.933 ± 4096.516 ops/s 4 3 4 50000 false thrpt 4 1706.464 ± 2694.685 ops/s

Sep 19 '24 16:09 iamsanjay

One thing I noticed that the moment I increased the maxConnectionPerDestination to 8 and keep on increasing till 64, I did'nt see any exception, able to run the full benchmark without any issue.

Below I ran it for 2 and first benchmark immediately failed!

MaxConnectionsPerDestination 2

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4  1730.811 ±   99.685  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4  8144.242 ± 7693.263  ops/s

Sep 19 '24 17:09 iamsanjay

MaxConnectionsPerDestination 128

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  15310.374 ± 2095.507  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2066.279 ±   64.590  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7189.295 ± 5208.593  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1774.028 ± 2590.350  ops/s

Sep 19 '24 17:09 iamsanjay

Wow; good find! If setMaxConnectionsPerDestination(1) then what? I wonder what Jetty is supposed to do when this limit is approached -- like is cancel_stream_error expected in some sense or should it be doing something else? I suspect a Jetty issue.

I should mention that we'll somewhat soon in #2410 have maxConnectionsPerDestination set more appropriately (not 4!) and then this issue here maybe won't be experienced.

Sep 19 '24 18:09 dsmiley

Actually I am still not convinced that issue is related to maxConnectionPerDestination setting. Because today When I ran it again and again, I found the issue with higher values as well such as 8, 16, 32. As few runs cannot tell which one is better, We can set up a dedicated benchmark for it to run it for different values of maxConnectionPerDestination and maxConcurrentStream (though this one is server level setting which I am not yet sure how we will configure in benchmark.)

There is a possibility that Jetty is cancelling stream because one of the shard went into recovery!

Server exception => org.apache.solr.common.SolrException: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDefensiveChecks(DistributedZkUpdateProcessor.java:1144)
org.apache.solr.common.SolrException: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/

request: http://127.0.0.1:54734/solr
Remote error message: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283) ~[solr-solrj-10.0.0
-SNAPSHOT.jar:10.0.0-SNAPSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:184) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:212) ~[metrics-core-4.2.26.jar:4.2.26]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNA
PSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:829) [?:?]

2024-09-20 21:58:32.200 ERROR (updateExecutor-38-thread-11-processing-testCollection_shard3_replica_n2 null-30262 core_node5 null 127.0.0.1:54735_solr testCollection shard3) [c:testCollection s:shard3 r:core_node5 x:testCollection_shard3_replica_n2] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient
          Error when calling SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode: http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4/ to http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4/ => org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:54734/solr: null

This one looks different!

 Client exception => org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
        at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:295)
org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_

Sep 20 '24 16:09 iamsanjay

We can test the setup where we are trying to index document and shard is down (rewriting the zookeeper cluster state)and see If we are getting the same cancel_stream error.

Also In some cases, I encountered an Out of Memory exception, suggesting we might need to scale it down a bit, as the current setup seems somewhat unstable.

Sep 20 '24 16:09 iamsanjay

This one looks different!

 Client exception => org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
        at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:295)
org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_

This error is not related, just make sure your version field has docValues enabled

Sep 20 '24 17:09 HoustonPutman

There is a possibility that Jetty is cancelling stream because one of the shard went into recovery!

I'm surprised / suspicious app level concerns/errors (Solr returning an HTTP error to the client) would result in a lower level HTTP TCP stream cancellation of some kind.

Sep 20 '24 18:09 dsmiley

We are using jetty 10, right? https://github.com/jetty/jetty.project/issues/10485

Sep 24 '24 17:09 iamsanjay

Yes; our versions are in versions.props

Sep 24 '24 18:09 dsmiley

Latest Benchmark with different values for maxConnectionPerDestination

Benchmark               (maxConnectionPerHost)  (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc                       1            4              3            4          50000           false  thrpt    4  1118.424 ±  847.096  ops/s
CloudIndexing.indexDoc                       2            4              3            4          50000           false  thrpt    4  1701.146 ± 1943.553  ops/s
CloudIndexing.indexDoc                       4            4              3            4          50000           false  thrpt    4  1726.712 ± 2448.684  ops/s
CloudIndexing.indexDoc                       8            4              3            4          50000           false  thrpt    4  1650.018 ± 2238.615  ops/s
CloudIndexing.indexDoc                      16            4              3            4          50000           false  thrpt    4  1651.497 ± 2117.253  ops/s
CloudIndexing.indexDoc                      32            4              3            4          50000           false  thrpt    4  1655.557 ± 2347.372  ops/s
CloudIndexing.indexDoc                      64            4              3            4          50000           false  thrpt    4  1695.647 ± 2139.217  ops/s
CloudIndexing.indexDoc                     128            4              3            4          50000           false  thrpt    4  1737.919 ± 2381.445  ops/s

It's a good improvement as we increase maxConnectionPerStream.

The pattern I see so far that If we increase the maxConnectionPerDestination then yes there are failures, but all of them are related to Out Of Memory exception rather than cancel stream error.

Clearly, increasing maxConnectionPerStream also requires more memory., which is understandable.

EDIT

Wait I was looking at the wrong value, actually there is no improvement as we increase maxConnectionPerDestination. Nothing!

Sep 26 '24 08:09 iamsanjay

There should be way to provide some sort of setting by which user can control this setting and then can play with it. We do have maxConnectionPerDestination but that is being used by http1 connection. And currently the value for that is 100000, which in my opinion would be too much for http2 as we are still experimenting with it.

We can also introduce another setting for http2 in UpdateShardHandler.

Sep 26 '24 08:09 iamsanjay

Yes; our versions are in versions.props

I inquired about the Jetty version because I posted a question regarding the cancel_stream error on the Jetty forum, and they informed me that Jetty 10 is at the end of the support and Please try Jetty 12 and report back if the problem still happens.

Note that Jetty 12 requires JDK 17!

Sep 26 '24 08:09 iamsanjay

Benchmark               (maxConnectionPerHost)  (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc                       1            4              3            4          50000           false  thrpt    4  1843.723 ± 2943.271  ops/s
CloudIndexing.indexDoc                       2            4              3            4          50000           false  thrpt    4  1756.116 ± 2323.291  ops/s
CloudIndexing.indexDoc                       4            4              3            4          50000           false  thrpt    4  1693.267 ± 2690.380  ops/s
CloudIndexing.indexDoc                       8            4              3            4          50000           false  thrpt    4  1598.918 ± 2505.014  ops/s
CloudIndexing.indexDoc                      16            4              3            4          50000           false  thrpt    4  1597.113 ± 2100.961  ops/s
CloudIndexing.indexDoc                      32            4              3            4          50000           false  thrpt    4  1694.114 ± 2307.802  ops/s
CloudIndexing.indexDoc                      64            4              3            4          50000           false  thrpt    4  1622.504 ± 2663.477  ops/s
CloudIndexing.indexDoc                     128            4              3            4          50000           false  thrpt    4  1633.025 ± 2277.204  ops/s
CloudIndexing.indexDoc                     256            4              3            4          50000           false  thrpt    4  1514.132 ± 2218.589  ops/s
CloudIndexing.indexDoc                     512            4              3            4          50000           false  thrpt    4  1645.532 ± 2375.068  ops/s
CloudIndexing.indexDoc                    1024            4              3            4          50000           false  thrpt    4  1590.969 ± 2128.108  ops/s

Sep 26 '24 11:09 iamsanjay

Thanks for getting those numbers, very interesting that it had no effect.

Sep 26 '24 15:09 HoustonPutman

This is a low priority thing; feel free to abandon this until we get to Jetty 12 and JDK 17. In the mean time, we know maxConnectionPerDestination is misconfigured to be too low; will be increased by default quite a bit -- Houston; I'm looking at you :-)

Sep 27 '24 02:09 dsmiley

Benchmark: HTTP2 fix cancel_stream_error

MaxConnectionsPerDestination
`Benchmark CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc`

MaxConnectionsPerDestination
`Benchmark CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc`

MaxConnectionsPerDestination
`Benchmark CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc CloudIndexing.indexDoc`

MaxConnectionsPerDestination 8

MaxConnectionsPerDestination 2

MaxConnectionsPerDestination 128