DataflowTemplates
DataflowTemplates copied to clipboard
[Bug]: MongoDB to BigQuery Template works with NONE but not FLATTEN Option
Related Template(s)
MongoDB to BigQuery Batch Job
What happened?
I have a Dataflow job using a default template to transfer data from MongoDB to BigQuery (Batch).
The job works with Option=NONE but not Option=Flatten.
For FLATTEN option, the 'Write to BigQuery' step fails. Could you please help?
Beam Version
Newer than 2.46.0
Relevant log output
Error message from worker: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3_10959096681912228204186652e04301_00001_00000, reached max retries: 3, last failed job: { "configuration" : { "jobType" : "LOAD", "labels" : { "beam_job_id" : "2023-07-20_13_22_38-14520935356309780491" }, "load" : { "createDisposition" : "CREATE_IF_NEEDED", "destinationTable" : { "datasetId" : "back4app_collections", "projectId" : "sober-sidekick", "tableId" : "user" }, "ignoreUnknownValues" : false, "sourceFormat" : "NEWLINE_DELIMITED_JSON", "useAvroLogicalTypes" : false, "writeDisposition" : "WRITE_APPEND" } }, "etag" : "A3T1le5NNrZrr4R+0l4vqg==", "id" : "sober-sidekick:US.beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3_10959096681912228204186652e04301_00001_00000-8", "jobReference" : { "jobId" : "beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3_10959096681912228204186652e04301_00001_00000-8", "location" : "US", "projectId" : "sober-sidekick" }, "kind" : "bigquery#job", "selfLink" : "https://bigquery.googleapis.com/bigquery/v2/projects/sober-sidekick/jobs/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3_10959096681912228204186652e04301_00001_00000-8?location=US", "statistics" : { "completionRatio" : 0.0, "creationTime" : "1689884936937", "endTime" : "1689884938478", "reservation_id" : "default-pipeline", "startTime" : "1689884937190", "totalSlotMs" : "598" }, "status" : { "errorResult" : { "location" : "gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "message" : "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 2; errors: 1. Please look into the errors[] collection for more details. File: gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "reason" : "invalid" }, "errors" : [ { "location" : "gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "message" : "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 2; errors: 1. Please look into the errors[] collection for more details. File: gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "reason" : "invalid" }, { "message" : "Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 2; errors: 1; max bad: 0; error percent: 0", "reason" : "invalid" }, { "location" : "gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "message" : "Error while reading data, error message: JSON parsing error in row starting at position 1018: No such field: lowercased. File: gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "reason" : "invalid" } ], "state" : "DONE" }, "user_email" : "[email protected]", "principal_subject" : "serviceAccount:[email protected]" }. org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:201) org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:154) org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:381)
I think this fails when there are fields in the mongo collection which are optional, and while flattening it expects all fields to be always available.
Did you find a solution or a workaround? I'm facing the same issue :/
Did you find a solution or a workaround? I'm facing the same issue :/
@gaurikapse No, I wrote my own solution for it.
Hi! i just ran into this exact problem? Could you be so kind to share what kind of solution you wrote to solve it @elaamrani @ashishjh-bst @gaurikapse
Hi! i just ran into this exact problem? Could you be so kind to share what kind of solution you wrote to solve it @elaamrani @ashishjh-bst @gaurikapse
I wrote this, which fits my usecase, won't work for collections with a lot of data. https://github.com/ashishjh-bst/MongoToBigQueryETL
Hi! i just ran into this exact problem? Could you be so kind to share what kind of solution you wrote to solve it @elaamrani @ashishjh-bst @gaurikapse
I wrote this, which fits my usecase, won't work for collections with a lot of data. https://github.com/ashishjh-bst/MongoToBigQueryETL
Thank you, I am actually new to javascript so forgive my noob question. I looked at the code, and am wondering if it will work for as UDF function on dataflow because I see several inputs are needed as environment variables.
Yes, I think the problem is this
That is used as the bigquery destination table schema but it is just the schema of the first document in the collection which may be short some properties relative to later documents in the collection