Benchmark: HTTP2 fix cancel_stream_error
Fixes Jetty HTTP2 cancel_stream_error by removing the flow control limit.
I was encountering this issue with CloudIndexing, esp when using single-node, single-replica, single-shard -- like 30K ops/sec. It'd routinely fail. I bumped this to 1000, then still hit it sometimes, then bumped to 2000. I haven't run it much since then to be honest. Then I saw how to disable it, which looks a little weird TBH (the default method returns a NO_RATE_CONTROL impl).
Could you please walk me through the process of running this specific benchmark which is giving the error, such as a command, or anything which helps me re-produce it on my system.
crave run -- ./jmh.sh CloudIndexing -jvmArgsAppend -Dsolr.bench.seed=823716671780822 -p useSmallDocs=true -p nodeCount=1 -p numShards=1 -p numReplicas=1 -I 8
The benchmark module uses JettySolrRunner via MiniClusterState.
Googling for this is inconclusive as to the cause. It seems the server is choosing to cancel the client's stream, but we don't know why. I dug into this before and thought it was a rate limiting mechanism, so this PR disables that. Perhaps the next step to debug is to ascertain what jetty config/logging would explain that. Would want to avoid debug level on everything -- too much output.
I recall encountering a somewhat similar error in the IndexFetcher class when we transitioned to HTTP2. The IndexFetcher class was prematurely closing the client connection and sending a GOAWAY packet to the server. When the flow control limit was reached, it triggered an exception.
I suspect the same issue might be occurring in this case. Unfortunately, I can't run the crave script anymore since upgrading to the new Apple Silicon chip. I have to look into that.
Was there anyway to enable more logs for benchmark? I wonder If it's too much overhead with the current Jetty configuration.
org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:57165/solr/testCollection/update?wt=javabin&version=2: Async exception during distributed update: Error from server at http://127.0.0.1:57164/solr: null
request: http://127.0.0.1:57164/solr
Remote error message: Server error writing document id 5234 to the index.
at org.apache.solr.client.solrj.impl.HttpSolrClientBase.processErrorsAndResponse(HttpSolrClientBase.java:280)
at org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:549)
at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:505)
at org.apache.solr.bench.index.CloudIndexing.indexDoc(CloudIndexing.java:138)
at org.apache.solr.bench.index.jmh_generated.CloudIndexing_indexDoc_jmhTest.indexDoc_thrpt_jmhStub(CloudIndexing_indexDoc_jmhTest.java:242)
at org.apache.solr.bench.index.jmh_generated.CloudIndexing_indexDoc_jmhTest.indexDoc_Throughput(CloudIndexing_indexDoc_jmhTest.java:111)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:527)
Maybe you are unfamiliar with Log4j but look at log4j2-bench.xml in this module.
MaxConnectionsPerDestination 8
Benchmark (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units
CloudIndexing.indexDoc 4 1 4 50000 true thrpt 4 16140.035 ± 1902.453 ops/s
CloudIndexing.indexDoc 4 1 4 50000 false thrpt 4 1818.103 ± 42.214 ops/s
CloudIndexing.indexDoc 4 3 4 50000 true thrpt 4 7280.289 ± 1774.366 ops/s
CloudIndexing.indexDoc 4 3 4 50000 false thrpt 4 1546.771 ± 2012.373 ops/s
MaxConnectionsPerDestination 16
Benchmark (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units
CloudIndexing.indexDoc 4 1 4 50000 true thrpt 4 16374.754 ± 1354.421 ops/s
CloudIndexing.indexDoc 4 1 4 50000 false thrpt 4 2053.636 ± 625.117 ops/s
CloudIndexing.indexDoc 4 3 4 50000 true thrpt 4 7919.480 ± 4504.359 ops/s
CloudIndexing.indexDoc 4 3 4 50000 false thrpt 4 1760.070 ± 2226.050 ops/s
MaxConnectionsPerDestination 32
Benchmark (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units
CloudIndexing.indexDoc 4 1 4 50000 true thrpt 4 15566.625 ± 1666.612 ops/s
CloudIndexing.indexDoc 4 1 4 50000 false thrpt 4 2084.775 ± 239.749 ops/s
CloudIndexing.indexDoc 4 3 4 50000 true thrpt 4 7505.612 ± 4698.342 ops/s
CloudIndexing.indexDoc 4 3 4 50000 false thrpt 4 1684.325 ± 2826.760 ops/s
MaxConnectionsPerDestination 64
Benchmark (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units
CloudIndexing.indexDoc 4 1 4 50000 true thrpt 4 15576.925 ± 1727.535 ops/s
CloudIndexing.indexDoc 4 1 4 50000 false thrpt 4 2111.832 ± 189.408 ops/s
CloudIndexing.indexDoc 4 3 4 50000 true thrpt 4 8069.933 ± 4096.516 ops/s
CloudIndexing.indexDoc 4 3 4 50000 false thrpt 4 1706.464 ± 2694.685 ops/s
One thing I noticed that the moment I increased the maxConnectionPerDestination to 8 and keep on increasing till 64, I did'nt see any exception, able to run the full benchmark without any issue.
Below I ran it for 2 and first benchmark immediately failed!
MaxConnectionsPerDestination 2
Benchmark (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units
CloudIndexing.indexDoc 4 1 4 50000 false thrpt 4 1730.811 ± 99.685 ops/s
CloudIndexing.indexDoc 4 3 4 50000 true thrpt 4 8144.242 ± 7693.263 ops/s
MaxConnectionsPerDestination 128
Benchmark (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units
CloudIndexing.indexDoc 4 1 4 50000 true thrpt 4 15310.374 ± 2095.507 ops/s
CloudIndexing.indexDoc 4 1 4 50000 false thrpt 4 2066.279 ± 64.590 ops/s
CloudIndexing.indexDoc 4 3 4 50000 true thrpt 4 7189.295 ± 5208.593 ops/s
CloudIndexing.indexDoc 4 3 4 50000 false thrpt 4 1774.028 ± 2590.350 ops/s
Wow; good find! If setMaxConnectionsPerDestination(1) then what? I wonder what Jetty is supposed to do when this limit is approached -- like is cancel_stream_error expected in some sense or should it be doing something else? I suspect a Jetty issue.
I should mention that we'll somewhat soon in #2410 have maxConnectionsPerDestination set more appropriately (not 4!) and then this issue here maybe won't be experienced.
Actually I am still not convinced that issue is related to maxConnectionPerDestination setting. Because today When I ran it again and again, I found the issue with higher values as well such as 8, 16, 32. As few runs cannot tell which one is better, We can set up a dedicated benchmark for it to run it for different values of maxConnectionPerDestination and maxConcurrentStream (though this one is server level setting which I am not yet sure how we will configure in benchmark.)
There is a possibility that Jetty is cancelling stream because one of the shard went into recovery!
Server exception => org.apache.solr.common.SolrException: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
at org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDefensiveChecks(DistributedZkUpdateProcessor.java:1144)
org.apache.solr.common.SolrException: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
request: http://127.0.0.1:54734/solr
Remote error message: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283) ~[solr-solrj-10.0.0
-SNAPSHOT.jar:10.0.0-SNAPSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:184) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:212) ~[metrics-core-4.2.26.jar:4.2.26]
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNA
PSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.base/java.lang.Thread.run(Thread.java:829) [?:?]
2024-09-20 21:58:32.200 ERROR (updateExecutor-38-thread-11-processing-testCollection_shard3_replica_n2 null-30262 core_node5 null 127.0.0.1:54735_solr testCollection shard3) [c:testCollection s:shard3 r:core_node5 x:testCollection_shard3_replica_n2] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient
Error when calling SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode: http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4/ to http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4/ => org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:54734/solr: null
This one looks different!
Client exception => org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:295)
org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
We can test the setup where we are trying to index document and shard is down (rewriting the zookeeper cluster state)and see If we are getting the same cancel_stream error.
Also In some cases, I encountered an Out of Memory exception, suggesting we might need to scale it down a bit, as the current setup seems somewhat unstable.
This one looks different!
Client exception => org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_ at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:295) org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
This error is not related, just make sure your version field has docValues enabled
There is a possibility that Jetty is cancelling stream because one of the shard went into recovery!
I'm surprised / suspicious app level concerns/errors (Solr returning an HTTP error to the client) would result in a lower level HTTP TCP stream cancellation of some kind.
We are using jetty 10, right? https://github.com/jetty/jetty.project/issues/10485
Yes; our versions are in versions.props
Latest Benchmark with different values for maxConnectionPerDestination
Benchmark (maxConnectionPerHost) (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units
CloudIndexing.indexDoc 1 4 3 4 50000 false thrpt 4 1118.424 ± 847.096 ops/s
CloudIndexing.indexDoc 2 4 3 4 50000 false thrpt 4 1701.146 ± 1943.553 ops/s
CloudIndexing.indexDoc 4 4 3 4 50000 false thrpt 4 1726.712 ± 2448.684 ops/s
CloudIndexing.indexDoc 8 4 3 4 50000 false thrpt 4 1650.018 ± 2238.615 ops/s
CloudIndexing.indexDoc 16 4 3 4 50000 false thrpt 4 1651.497 ± 2117.253 ops/s
CloudIndexing.indexDoc 32 4 3 4 50000 false thrpt 4 1655.557 ± 2347.372 ops/s
CloudIndexing.indexDoc 64 4 3 4 50000 false thrpt 4 1695.647 ± 2139.217 ops/s
CloudIndexing.indexDoc 128 4 3 4 50000 false thrpt 4 1737.919 ± 2381.445 ops/s
It's a good improvement as we increase maxConnectionPerStream.
The pattern I see so far that If we increase the maxConnectionPerDestination then yes there are failures, but all of them are related to Out Of Memory exception rather than cancel stream error.
Clearly, increasing maxConnectionPerStream also requires more memory., which is understandable.
EDIT
Wait I was looking at the wrong value, actually there is no improvement as we increase maxConnectionPerDestination. Nothing!
There should be way to provide some sort of setting by which user can control this setting and then can play with it. We do have maxConnectionPerDestination but that is being used by http1 connection. And currently the value for that is 100000, which in my opinion would be too much for http2 as we are still experimenting with it.
We can also introduce another setting for http2 in UpdateShardHandler.
Yes; our versions are in
versions.props
I inquired about the Jetty version because I posted a question regarding the cancel_stream error on the Jetty forum, and they informed me that Jetty 10 is at the end of the support and Please try Jetty 12 and report back if the problem still happens.
Note that Jetty 12 requires JDK 17!
Benchmark (maxConnectionPerHost) (nodeCount) (numReplicas) (numShards) (preGenerate) (useSmallDocs) Mode Cnt Score Error Units
CloudIndexing.indexDoc 1 4 3 4 50000 false thrpt 4 1843.723 ± 2943.271 ops/s
CloudIndexing.indexDoc 2 4 3 4 50000 false thrpt 4 1756.116 ± 2323.291 ops/s
CloudIndexing.indexDoc 4 4 3 4 50000 false thrpt 4 1693.267 ± 2690.380 ops/s
CloudIndexing.indexDoc 8 4 3 4 50000 false thrpt 4 1598.918 ± 2505.014 ops/s
CloudIndexing.indexDoc 16 4 3 4 50000 false thrpt 4 1597.113 ± 2100.961 ops/s
CloudIndexing.indexDoc 32 4 3 4 50000 false thrpt 4 1694.114 ± 2307.802 ops/s
CloudIndexing.indexDoc 64 4 3 4 50000 false thrpt 4 1622.504 ± 2663.477 ops/s
CloudIndexing.indexDoc 128 4 3 4 50000 false thrpt 4 1633.025 ± 2277.204 ops/s
CloudIndexing.indexDoc 256 4 3 4 50000 false thrpt 4 1514.132 ± 2218.589 ops/s
CloudIndexing.indexDoc 512 4 3 4 50000 false thrpt 4 1645.532 ± 2375.068 ops/s
CloudIndexing.indexDoc 1024 4 3 4 50000 false thrpt 4 1590.969 ± 2128.108 ops/s
Thanks for getting those numbers, very interesting that it had no effect.
This is a low priority thing; feel free to abandon this until we get to Jetty 12 and JDK 17. In the mean time, we know maxConnectionPerDestination is misconfigured to be too low; will be increased by default quite a bit -- Houston; I'm looking at you :-)