solr-operator icon indicating copy to clipboard operation
solr-operator copied to clipboard

Scaling Down Causes "Down" Replicas

Open Kamalsaiperla opened this issue 11 months ago • 3 comments

Environment Solr Operator Version: 0.8.1 → 0.9.0 (same issue) Solr Image Version: 9.6.1 Platform: GKE Custom Plugins: Yes HPA Configuration: Configured for CPU-based scaling

Issue Description When scaling up (averageUtilization=10%), Solr pods successfully scale to the maxReplicas (10) without issues. However, when scaling down (averageUtilization=80%), Solr does not reduce the number of pods, and several shards show "Down" replicas.

Steps to Reproduce Deploy Solr Operator (0.8.1, later tested with 0.9.0) with Solr 9.6.1. Configure an HPA with CPU-based scaling. Create collections and insert documents. Test 1: Decrease averageUtilization to 10% → Pods scale up to 10 (expected behavior). Test 2: Increase averageUtilization to 80% → Pods do not scale down, and some shards show "Down" replicas.

Expected Behavior When increasing averageUtilization, pods should scale down as per HPA settings. Shards should not end up in "Down" state.

Observed Behavior Pods remain at max (10). Some shards have "Down" replicas.

Additional Information Upgrading the Solr Operator from 0.8.1 to 0.9.0 did not resolve the issue.

Screenshots

Image Image Image Image Image

Logs: 2025-01-30 16:58:24.643 ERROR (qtp1155769010-5575-search-solrcloud-4.csr-58880) [c:l5RecommendationCollection s:shard2 r:core_node502 x:l5RecommendationCollection_shard2_replica_n501 t:search-solrcloud-4.csr-58880] o.a.s.u.UpdateLog Exception reading versions from log => java.io.EOFException at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:79) java.io.EOFException: null at org.apache.solr.common.util.FastInputStream.readUnsignedByte(FastInputStream.java:79) ~[?:?] at org.apache.solr.common.util.FastInputStream.readInt(FastInputStream.java:239) ~[?:?] at org.apache.solr.update.TransactionLog$FSReverseReader.<init>(TransactionLog.java:889) ~[?:?] at org.apache.solr.update.TransactionLog.getReverseReader(TransactionLog.java:705) ~[?:?] at org.apache.solr.update.UpdateLog$RecentUpdates.update(UpdateLog.java:1613) ~[?:?] at org.apache.solr.update.UpdateLog$RecentUpdates.<init>(UpdateLog.java:1528) ~[?:?] at org.apache.solr.update.UpdateLog.getRecentUpdates(UpdateLog.java:1727) ~[?:?] at org.apache.solr.handler.component.RealTimeGetComponent.processGetVersions(RealTimeGetComponent.java:1262) ~[?:?] at org.apache.solr.handler.component.RealTimeGetComponent.process(RealTimeGetComponent.java:161) ~[?:?] at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:465) ~[?:?] at org.apache.solr.handler.RealTimeGetHandler.handleRequestBody(RealTimeGetHandler.java:43) ~[?:?] at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:226) ~[?:?] at org.apache.solr.core.SolrCore.execute(SolrCore.java:2886) ~[?:?] at org.apache.solr.servlet.HttpSolrCall.executeCoreRequest(HttpSolrCall.java:910) ~[?:?] at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:596) ~[?:?] at org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:262) ~[?:?] at org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:219) ~[?:?] at org.apache.solr.servlet.ServletUtils.traceHttpRequestExecution2(ServletUtils.java:249) ~[?:?] at org.apache.solr.servlet.ServletUtils.rateLimitRequest(ServletUtils.java:215) ~[?:?] at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:213) ~[?:?] at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) ~[?:?] at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:210) ~[jetty-servlet-10.0.20.jar:10.0.20] at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635) ~[jetty-servlet-10.0.20.jar:10.0.20] at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) ~[jetty-servlet-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:598) ~[jetty-security-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1580) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1384) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) ~[jetty-servlet-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1553) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1306) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:228) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:141) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:301) ~[jetty-rewrite-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.Server.handle(Server.java:563) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:461) ~[jetty-server-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.produce(AdaptiveExecutionStrategy.java:193) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.HTTP2Connection.produce(HTTP2Connection.java:208) ~[http2-common-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.HTTP2Connection.onFillable(HTTP2Connection.java:155) ~[http2-common-10.0.20.jar:10.0.20] at org.eclipse.jetty.http2.HTTP2Connection$FillableCallback.succeeded(HTTP2Connection.java:450) ~[http2-common-10.0.20.jar:10.0.20] at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) ~[jetty-io-10.0.20.jar:10.0.20] at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) ~[jetty-io-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) ~[jetty-util-10.0.20.jar:10.0.20] at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) ~[jetty-util-10.0.20.jar:10.0.20] at java.base/java.lang.Thread.run(Unknown Source) [?:?] 2025-01-30 16:58:24.643 INFO (qtp1155769010-5575-search-solrcloud-4.csr-58880) [c:l5RecommendationCollection s:shard2 r:core_node502 x:l5RecommendationCollection_shard2_replica_n501 t:search-solrcloud-4.csr-58880] o.a.s.c.S.Request webapp=/solr path=/get params={distrib=false&qt=/get&fingerprint=false&getVersions=100&wt=javabin&version=2} status=0 QTime=0 2025-01-30 16:58:24.644 INFO (qtp1155769010-6634-search-solrcloud-4.csr-58881) [c:l5RecommendationCollection s:shard2 r:core_node558 x:l5RecommendationCollection_shard2_replica_n557 t:search-solrcloud-4.csr-58881] o.a.s.c.S.Request webapp=/solr path=/get params={distrib=false&qt=/get&fingerprint=false&getVersions=100&wt=javabin&version=2} status=0 QTime=0 2025-01-30 16:41:20.744 INFO (zkCallback-13-thread-61) [c:l5RecommendationCollection s:shard2 r:core_node490 x:l5RecommendationCollection_shard2_replica_n489 t:] o.a.s.u.PeerSync PeerSync: core=l5RecommendationCollection_shard2_replica_n489 url=http://search-solrcloud-9.csr:80/solr Received 29 versions from http://search-solrcloud-5.csr:80/solr/l5RecommendationCollection_shard2_replica_n97/ fingerprint:null ERROR (recoveryExecutor-10-thread-212-processing-l5RecommendationCollection_shard3_replica_n589 search-solrcloud-0.csr-62278 move-replicas-search-solrcloud-941610687021459 core_node590 create search-solrcloud-4.csr:80_solr l5RecommendationCollection shard3) [c:l5RecommendationCollection s:shard3 r:core_node590 x:l5RecommendationCollection_shard3_replica_n589 t:search-solrcloud-0.csr-62278] o.a.s.h.ReplicationHandler Index fetch failed => org.apache.solr.common.SolrException: Unable to download _7s2.fdt completely. Downloaded 193986560!=400378449 ERROR (recoveryExecutor-10-thread-212-processing-l5RecommendationCollection_shard3_replica_n589 search-solrcloud-0.csr-62278 move-replicas-search-solrcloud-941610687021459 core_node590 create search-solrcloud-4.csr:80_solr l5RecommendationCollection shard3) [c:l5RecommendationCollection s:shard3 r:core_node590 x:l5RecommendationCollection_shard3_replica_n589 t:search-solrcloud-0.csr-62278] o.a.s.c.RecoveryStrategy Error while trying to recover => org.apache.solr.common.SolrException: Replication for recovery failed.

Operator log: 2025-01-30T17:19:35Z INFO Found async status {"controller": "solrcloud", "controllerGroup": "solr.apache.org", "controllerKind": "SolrCloud", "SolrCloud": {"name":"search","namespace":"csr"}, "namespace": "csr", "name": "search", "reconcileID": "0b0fae61-ad23-44c0-8286-c4fe88f3aecb", "evictionReason": "scaleDown", "requestId": "move-replicas-search-solrcloud-9", "state": "running"}

Kamalsaiperla avatar Jan 30 '25 17:01 Kamalsaiperla

That is a lot of replicas per shard... Are you running out of space possibly?

HoustonPutman avatar Mar 13 '25 18:03 HoustonPutman

I saw something similar happen when running out of disk space. Basically solr-operator (or something) started continuously adding replicas which broke a few collections badly enough they needed to be restored from backups.

jstaf avatar Mar 17 '25 20:03 jstaf

Oof, yeah that's really not great. I think ultimately it's a Solr bug that when an error happens when moving/creating replicas, Solr doesn't always delete the replicas it was trying to create. So it can create more and more broken replicas.

HoustonPutman avatar Mar 17 '25 21:03 HoustonPutman