management-api-for-apache-cassandra icon indicating copy to clipboard operation
management-api-for-apache-cassandra copied to clipboard

K8SSAND-912 ⁃ Review errors returned by management API

Open Miles-Garnsey opened this issue 4 years ago • 1 comments

It might prove profitable to do a full review of the errors returned by the management API. At present most errors returned tend to be 500s, with no error message included.

For example, if keyspace creation fails because we've sent a bad request we get the following in the management API (or Cassandra?) logs:

Caused by: org.apache.cassandra.exceptions.ConfigurationException: Unrecognized strategy option {reaper-test} passed to NetworkTopologyStrategy for keyspace reaper_db
at org.apache.cassandra.locator.AbstractReplicationStrategy.validateExpectedOptions(AbstractReplicationStrategy.java:457)
at org.apache.cassandra.locator.NetworkTopologyStrategy.validateExpectedOptions(NetworkTopologyStrategy.java:303)
at org.apache.cassandra.locator.AbstractReplicationStrategy.validateReplicationStrategy(AbstractReplicationStrategy.java:402)
at org.apache.cassandra.schema.ReplicationParams.validate(ReplicationParams.java:78)
at org.apache.cassandra.schema.KeyspaceParams.validate(KeyspaceParams.java:94)
at org.apache.cassandra.cql3.statements.schema.CreateKeyspaceStatement.apply(CreateKeyspaceStatement.java:87)
at org.apache.cassandra.schema.Schema.transform(Schema.java:588)
at org.apache.cassandra.schema.MigrationManager.lambda$announce$2(MigrationManager.java:226)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
... 2 common frames omitted

But the client gets back this:

ERROR controllers.Reaper failed to create keyspace {"reaper": "default/reaper-test", "error": "incorrect status code of 500 when calling endpoint"} 

This seems not quite right, because we haven't encountered an internal error. We've sent the management API a bad request, which should register as a 400 error, not a 500. In addition, troubleshooting would be assisted by actually including the specifics of the error.

A similar thing happens when the management API is waiting for Cassandra to come up, it returns 500 errors, when it should probably be handing back a 100 series code to indicate a continuing process - i.e. not success and not failure. (There may be other options here like a 503 if we are unable to distinguish between Cassandra still bootstrapping and Cassandra having bootstrapped and subsequently gone down).

HTTP status codes can convey quite rich information , it would be nice to leverage them more!

┆Issue is synchronized with this Jira Story by Unito

Miles-Garnsey avatar Sep 16 '21 04:09 Miles-Garnsey

Diving deeper into this codebase, I've found that a blocker for this issue might be the way that errors are communicated back from the agents to the management API server.

The particular CALL CQL commands are held in the server's resources package (e.g. here. These functions map HTTP endpoints and verbs to the agent's methods via cqlService.executePreparedStatement() calls.

It appears that executePreparedStatement() only throws ConnectionClosedException, which is not the richest type of error that we can return. I'm wondering if this is where we need to focus attention so that the CALL commands issued to CQL return an error type and string which can be deserialised out of an Object or ResultSet.

Miles-Garnsey avatar Sep 20 '21 05:09 Miles-Garnsey