BigQuery-Python
BigQuery-Python copied to clipboard
Unable to get failure reason for jobs that have failed
Happy to write this myself, but just wanted to validate direction:
Use Case: I've submitted a query asynchronously to BQ. The query fails for any number of reasons. The job will be marked as "done" but I do not have the ability to get the reason for the failure and the current check_job implementation will just say 0 rows.
Solutions, in no particular order:
- Leave it alone, it's good enough
- Hijack check_job and have it return richer job status information.
- Add a new get_job_status method that's responsible for returning the status, including failure reasons if any.
I'm guessing number 2 is probably the best option, but wanted to pass it by before making a pull request.
Whatdya think?
I'm in favor of 2 as well. It might make sense to have check_job
return a "job status" object which encapsulates the various job properties: id, complete, total rows, failure reason, etc. It's a breaking API change obviously, but I think it's probably justified here.
One super annoying thing about the BQ API is that the status code returned for getQueryResults on failed jobs represents the reason why the job failed, not for the getQueryResults request.
For instance, if I try to submit a query against a table that doesn't exist, getQueryResults returns a 404. If I submit syntactically incorrect SQL to a batch job, then getQueryResults return a 400. Personally, I think that's an odd API design as I expect the status code to be in context of the current request, not the job I submitted and am now checking on the status. Anyways, not a big deal; it just means that the logic in check_job will have to catch HttpError and just involve a lot more logic than I think it should have to.
I'm guessing we probably want our own shape for that job status object, just to reduce the number of breaking changes? Probably model off the current structure? If it changes in the future, we can just continue to use the same shape or decide to make another breaking API change. Something like this?
{
"job_id": "ABCD123",
"state": "DONE", //done, pending, running
"total_rows": 0, //valid values 0-n, will be missing if in error state, mutually exclusive with 'error'
"error": { //if errors, otherwise missing
"message": "Encountered bad SQL at line 1, column 53. Was expecting: <EOF>", //top-level message as reported by BQ
"errors": [
{
"reason": "invalidQuery",
"message": "Encountered bad SQL at line 1, column 53. Was expecting: <EOF>"
},
{
"reason": "invalidQuery",
"message": "Encountered bad SQL at line 4, column 22. Was expecting: SELECT"
}
]
}
}
Yeah, that seems reasonable. I think I would be in favor of making that an actual class instead of using a dict. Otherwise the user has to know what keys are set as opposed to operating on a rigid object. This also makes it a little easier to change things without breaking the API down the road.