athena-datasource icon indicating copy to clipboard operation
athena-datasource copied to clipboard

Validate that Error Statuses from AWS are not obfuscated

Open skuzzle opened this issue 1 year ago • 4 comments

What happened: We're often seeing random Athena errors like this one:

error querying the database: GENERIC_INTERNAL_ERROR: Cannot invoke "io.trino.execution.SqlTaskExecution$DriverSplitRunnerFactory.enqueueSplits(java.util.Set, boolean)" because "factory" is null

Those errors are impossible to predict and apparently also not avoidable. However, when such an error happens, it often comes back with a 4xx error code from the /api/ds/query endpoint. We are using this endpoint for some automated tests and kind of rely on the response code to decide whether to retry the test. Retrying a something that really is a client error (e.g. query with syntax error) doesn't really make sense. So it would be nice to be able to distinguish between real client errors and internal errors.

What you expected to happen: I'd expect internal Athena errors to be returned with a 5xx response code for the /query endpoint. 4xx response code makes sense for real client errors like sending a query with a syntax error.

How to reproduce it (as minimally and precisely as possible): Not really reproducible in a reliable way

Environment:

  • Grafana version: 10.2.3
  • Plugin version: 2.14.1

skuzzle avatar May 06 '24 06:05 skuzzle

Hi @skuzzle, hmm this is a tough one. I would think that we would want to forward whatever status codes AWS returns to us and not get too opinionated with our error handling here but I am not familiar with that specific error. Do you have any more information on it? It does seem like many of these "GENERIC_INTERAL_ERRORS" are often syntax/client related, which makes me a bit hesitant to add something: https://repost.aws/knowledge-center/athena-generic-internal-error

I think for us to implement something like this we'd need more information about the error so we feel confident that it should in fact be treated like a 5xx in all instances. It sounds like in your case it probably makes sense, but I'm not confident that is true in all cases since I don't understand what it means. We can ask our contacts at AWS to see if they are more familiar, it's also possible they do respond with a 5xx and there's a bug in our code where we do not forward that along properly.

In the mean time it could be helpful if you also want to reach out to AWS, and do let us know if you learn anything more about this error. Thanks for reporting!

sarahzinger avatar May 06 '24 13:05 sarahzinger

it's also possible they do respond with a 5xx and there's a bug in our code where we do not forward that along properly

This is what I suspected is happening. I'm not expecting you to fix Athena error handling on your side. So if you are already forwarding 5xx as 5xx and 4xx as 4xx then you probably should not change a thing and it's up to Athena folks to fix this.

We spoke to AWS, resp. with our cloud contractor about some of the odd failures and the gist was that those kind of errors can happen in a distributed system and they should be handled with a retry on client side. Now this obviously makes only sense if those errors do come back as 5xx

skuzzle avatar May 06 '24 15:05 skuzzle

Ah I see what you mean @skuzzle its going to be hard for us to cause the specific error you're talking about to test since we don't have repro steps, but we certainly can stub out different error status codes from AWS and see what gets returned.

I just tried hard coding a fake error with a status of 500, and noticed that while we returned that in the response object, the actual status of the response was a 400. I'm guessing we probably have a bug here. I updated the title to have someone double check/update any places they can find where we might not be forwarding the error status code.

Thanks for talking it through and making the issue! I'm going to put this into our backlog for now since we don't have a specific timeline just yet on fixing it and don't want to overpromise a delivery date till we can figure it out, but will be sure to bring it up to the team. Also if you have any interest in contributing we'd be happy to review any prs :)

sarahzinger avatar May 08 '24 14:05 sarahzinger

Thx for picking this up. For the sake of completeness, here are two further errors we encountered that came back with a 4xx status from Grafana but that clearly look like internal Athena errors:

error querying the database: [ErrorCode: INTERNAL_ERROR_QUERY_ENGINE] Amazon Athena experienced an internal error while executing this query. Please contact AWS support for further assistance. You will not be charged for this query. We apologize for the inconvenience.
error querying the database: Query exhausted resources at this scale factor

skuzzle avatar May 10 '24 06:05 skuzzle

Hi @skuzzle thank you for the patience on this. Unfortunately due to how our query service works, we're unable to return the same http status from the queryData endpoint as the one Athena sends up back (we always return 400 if the query is unsuccessful). However, we've just released an improvement (v3.1.0 of Athen plugin) in how we return the error object and the status field within it. You might have found a workaround in the meantime, but this might be something that helps your use case, if you're able to access the response body

idastambuk avatar Feb 28 '25 13:02 idastambuk