astyanax icon indicating copy to clipboard operation
astyanax copied to clipboard

Does not auto-discover nodes at all

Open jacek99 opened this issue 10 years ago • 13 comments

We have a customer who discovered this in production: none of the cluster node auto-discovery logic seems to work at all.

We have our cluster set up like this

        AstyanaxContext<Cluster> context = new AstyanaxContext.Builder()
                .forCluster("MyCluster")
                .forKeyspace(cassandraConfig.getKeyspace())
                .withAstyanaxConfiguration(new AstyanaxConfigurationImpl()
                        .setDiscoveryType(NodeDiscoveryType.RING_DESCRIBE)
                        .setConnectionPoolType(ConnectionPoolType.TOKEN_AWARE)
                        .setDefaultWriteConsistencyLevel(QUORUM)
.setDefaultReadConsistencyLevel(QUORUM)
                )
                .withConnectionPoolConfiguration(new ConnectionPoolConfigurationImpl("MyCluster")
                        .setPort(cassandraConfig.getPort())
                        .setSeeds(cassandraConfig.getSeeds())
                         // limit node discovery to local datacenter only
                        .setLocalDatacenter(cassandraConfig.getDatacenter())
                )
                .withConnectionPoolMonitor(new CountingConnectionPoolMonitor())
                .buildCluster(ThriftFamilyFactory.getInstance());

We have tried all different combinations: a) removing the setDiscoveryType() and just leaving setConnectionPoolType() in b) both of them together as in the example above c) removing the setLocalDataCenter() call (just in case).

In our case we have a cluster of 3 nodes, we see them talking to each other via gossip.

If we have the "seeds" set to just one of the node, e.g. "node:9160", we can see the connection pool discovering the original seed node

INFO  [2014-04-10 15:32:42,855] com.netflix.astyanax.connectionpool.impl.CountingConnectionPoolMonitor: AddHost: <IP address of first host>

We never see the other 2 hosts being discovered.

If we bring the first host down, the cluster is still running (we can connect to the other 2 nodes via casandra-cli and it works).

But our apps starts throwing exceptions immediately with NoHostAvailableException.

We tried this on CentOS, Mint, Fedora with Cassandra 1.1.

Please help up with any pointers that we could to investigate it further, this is a critical production issue for us,

Much appreciated

jacek99 avatar Apr 10 '14 16:04 jacek99

Sorry, forgot to mention we tried it with Cassandra 2.0.6 as well, same result

jacek99 avatar Apr 10 '14 16:04 jacek99

@jacek99 Apologies for not getting to this earlier. Yeah the node discovery process is a bit clunky. It's probably a configuration issue that needs better documentation, or it may be an Astyanax bug that we haven't discovered ourselves yet. In any case, I'll set up the same context on my local cluster and debug this.

Thanks for reporting, will get back to you soon.

opuneet avatar May 01 '14 16:05 opuneet

much appreciated, thanks

jacek99 avatar May 01 '14 16:05 jacek99

@jacek99 I noticed that you set your local datacenter. Are you sure that this isn't filtering out the rest of the hosts? Note that the local datacenter feature is not equipped with fallback behavior where Astyanax switches to the non local datacenter nodes when the local nodes go away.

Make sense?

opuneet avatar May 01 '14 16:05 opuneet

FWIW, I consider auto-fallback to a different DC to be a bit of an anti pattern since it violates the consistency guarantees of LOCAL_QUORUM and LOCAL_ONE.

If you fail over to another DC deliberately in client code, that can be fine, but the driver failing over without the explicit permission of the app, is a bad idea.

-Tupshin On May 1, 2014 12:17 PM, "opuneet" [email protected] wrote:

@jacek99 https://github.com/jacek99 I noticed that you set your local datacenter. Are you sure that this isn't filtering out the rest of the hosts? Note that the local datacenter feature is not equipped with fallback behavior where Astyanax switches to the non local datacenter nodes when the local nodes go away.

Make sense?

— Reply to this email directly or view it on GitHubhttps://github.com/Netflix/astyanax/issues/503#issuecomment-41925590 .

tupshin avatar May 01 '14 16:05 tupshin

no, we removed the local datacenter and tested it without that line. As I mentioned in the initial report, we have tried all sorts of different combinations.

Whatever we use, unless it is specified in the seeds, Astyanax never seems to auto-discover the other nodes.

jacek99 avatar May 01 '14 16:05 jacek99

.buildCluster internally won't issue a ring describe. You have to use .buildKeyspace.

tsteinmaurer avatar May 06 '14 12:05 tsteinmaurer

We get the keyspace via

cluster.getKeyspace();

shouldn't that due a ring describe as well?

jacek99 avatar May 26 '14 18:05 jacek99

I don't see that in the source code. Have a look on AstyanaxContext.buildCluster() vs. buildKeyspace() and their differences in respect to a ring describe.

tsteinmaurer avatar May 27 '14 05:05 tsteinmaurer

I agree with @jacek99

The cluster.getKeyspace() should do the same thing as context.builderKeyspace() right?

The AstyanaxContext.Builder can be used to build a cluster or a keyspace, and if you can build a keyspace from the cluster, it should construct the keyspace exactly the same. I feel this is a bug in Astyanax, and unfortunately the differences are not capture/documented (from what I can find).

Obsidion avatar Jul 01 '14 17:07 Obsidion

I tried using the buildKeyspace() method and it still was iffy with node discovery. We had put that on hold due to other issues, but I should go back to testing it later this week.

It still was not quite working as expected.

jacek99 avatar Jul 01 '14 17:07 jacek99

Our system uses the context.builder.buildCluster().getClient().getKeyspace() exclusively (we need some of the operations available on the cluster object). I hacked in the context.builder.buildKeyspace() and saw that auto discovery worked correctly (though it replaced domain names with ip, but that didn't seem to be an issue for my system). Unfortunately to make this change in our internal library, I'd have to hold onto both the keyspace and cluster obj, and I'm afraid of the other inconsistencies that might crop up by doing this.

Obsidion avatar Jul 01 '14 17:07 Obsidion

@opuneet Any progress on this? I came across this issue when configuring a client in production and was unaware that buildCluster does not issue a ring describe. At the very least can this be documented so that future users are not blindsided by this difference in buildCluster and buildKeyspace.

markreddy avatar Sep 29 '14 21:09 markreddy