Non-insertion errors in BigQueryIO.Write cause infinite loop
When streaming inserts into BigQuery using BigQueryIO.Write, if there is an error other than a row insertion error (e.g. an IOException), BigQueryIO assumes this must be a rate limit error, and so enters an infinite loop of retries. This is logical for rate limit errors, but other types of error may also appear.
One example is a "Not found" error for the BigQuery table. This can happen if the table was originally created by BigQueryIO.Write (where CreateDisposition is CREATE_IF_NEEDED), but has since been deleted (since created tables are cached). This may be more likely to happen in a long-term streaming job. The infinite loop of retries does not help, as table creation is not retried, only row insertion. The table either needs to be created with an external process, or the pipeline needs to be restarted (thereby clearing the cache).
This can be the case regardless of setting InsertRetryPolicy, as these errors are not insertion errors. As a result, we see logs such as "INFO: BigQuery insertAll error, retrying: Not found: Table <project>:<dataset>.<table>" even if we have set InsertRetryPolicy to "neverRetry()", which is confusing behaviour. I expect similar issues to occur for other types of error (e.g. no response from BigQuery API).
To recreate this issue, you can create a pipeline which inserts into BigQuery using BigQueryIO, where the table does not exist beforehand but should be created by BigQueryIO (i.e. CreateDisposition = CREATE_IF_NEEDED). Then mock BigQueryServicesImpl's call to create the BigQuery table, causing no table to be created (I did this in a brute force method by creating my own BigQueryServicesImpl and feeding in using ".withTestServices()"). The pipeline will enter an infinite loop, logging "INFO: BigQuery insertAll error, retrying: Not found: Table <project>:<dataset>.<table>".
One suggestion to avoid this is to add another retry policy, one to control retries for non-insertion errors. This could be optional for users and effective here. An alternative/additional option could be to check for "table not found" errors in this clause and if encountered, retry table creation before next retrying insertion.
Imported from Jira BEAM-9492. Original Jira may contain additional context. Reported by: JoeC_SE.
I actually think the current behavior is correct/reasonable. When we hit a record that can't be inserted (for this reason or others), within BigQueryIO it is reasonable to fail the work item (which is what we do in batch or streaming mode). We could skip the local retries here, and that would slightly improve the experience (though this doesn't really help a ton, it just fails us slightly faster).
At that point, the behavior of the full pipeline is dependent on the runner/streaming mode. Most runners fail pipelines after a few retries in batch mode and retry continuously in streaming mode. This is because the expectation for a streaming pipeline is that its more feasible to update the pipeline (or in this case create the table) than it is to relaunch a new pipeline. A streaming pipeline will never stop retrying for most runners, and that is intentional, regardless of error.
So I don't think we should make a change here