Solandra icon indicating copy to clipboard operation
Solandra copied to clipboard

One node down search fails

Open JimKerwood opened this issue 14 years ago • 12 comments

Seems if one node of the Solandra cluster is down/being bounced the query's fail. Not sure if a put will fail or not.

JimKerwood avatar Apr 10 '11 02:04 JimKerwood

You mean reads? If you want it to work you would need to increase the replication factor of cassandra for the L keyspace.

tjake avatar Apr 12 '11 02:04 tjake

We don't even get that far. Either it will time out with the 1024 tries or if while it is trying I bring back up the node it will throw an exception with connection refused (since it isn't initilaized I'm guessing but the port is there).

JimKerwood avatar Apr 12 '11 12:04 JimKerwood

You are saying you can't even start solandra?

To change the replication factor use the supplied cassandra-cli tool:

cassandra-tool/cassandra-cli --host localhost

update keyspace L with replication_factor=2;

tjake avatar Apr 12 '11 15:04 tjake

No here is the use case:

  1. All 6 boxes running. Querys all work.
  2. Bring 1 box down for maint. Querys now start timing out. Assume querys should continue on running boxes.
  3. Bring box back up. Any query trying gets a socket timeout.
  4. When all back running all querys work.

I think the HTTP request is trying to hit all 6 boxes. It is failing there not even down at the Cassandra level.

Some of the stacktrace:

HTTP ERROR 500 Problem accessing /solandra/checks/select. Reason: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused org.apache.solr.common.SolrException: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:282) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) ...

Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)

at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:422)

JimKerwood avatar Apr 12 '11 16:04 JimKerwood

Why should queries continue when there is missing data? If you have a replication factor of 1 and you take down a box then it should error IMO.

I think once you turn it back on the cluster should start working again if thats not the case then that's a bug and I need to fix it....

tjake avatar Apr 12 '11 16:04 tjake

Agree with the replication factor of 1.
So you are saying if I have a replication factor of 2 and I have one machine down this will not error anymore? If so I am satisfied.
Though if I set the replication factor to 2 and it still errors with one machine down I would say this should be fixed.

JimKerwood avatar Apr 12 '11 17:04 JimKerwood

Correct. If you change the replication factor and repair the nodes using cassandr-tools/nodetool -h localhost repair L on each node then it will work.

tjake avatar Apr 12 '11 17:04 tjake

Even after changing replication and repairing problem exists. If a node is down all other nodes wait (timeout if left long enough)

JimKerwood avatar Apr 20 '11 16:04 JimKerwood

@JimKerwood Can you reproduce this issue with a fresh cluster set to RL=2 before you set any schemas or index any data?

davidstrauss avatar May 12 '11 21:05 davidstrauss

I misspoke. Rf=2 is tricky because a quorum is 2. Quorum is used internally for document Id and shard tracking.

Rf=3 should work

tjake avatar May 13 '11 00:05 tjake

Hi

I have two nodes running, set replication_factor:3 and run repair tool on L keyspace. When one of the nodes goes down, search fails on the remaining node.

I get this exception

read command failed after 1024attempts java.io.IOException: Read command failed after 1024attempts at lucandra.CassandraUtils.robustRead(CassandraUtils.java:625) at lucandra.CassandraUtils.robustRead(CassandraUtils.java:634) at solandra.SolandraComponent.flushCache(SolandraComponent.java:67) at solandra.SolandraComponent.prepare(SolandraComponent.java:115) at solandra.SolandraQueryComponent.prepare(SolandraQueryComponent.java:45) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at solandra.SolandraDispatchFilter.execute(SolandraDispatchFilter.java:171) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at solandra.SolandraDispatchFilter.doFilter(SolandraDispatchFilter.java:137) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

request: http://192.168.1.99:8983/solandra/reuters~0/select at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)

I also tried changing solandra.consistency from QUORUM to ONE on solandra.properties, but this didn't help.

Any ideas how to fix this or if i'm doing something wrong?

topoqdm avatar May 29 '12 16:05 topoqdm

Hi Jake, I tried replication 2 and 3, the problem persists, once you have a node down you cannot do any request to any other live nodes.

Thanks

leonardlabuneti avatar Sep 16 '13 22:09 leonardlabuneti