DataflowTemplates icon indicating copy to clipboard operation
DataflowTemplates copied to clipboard

[Bug]: MongoDB to BigQuery Template works with NONE but not FLATTEN Option

Open elaamrani opened this issue 1 year ago • 7 comments

Related Template(s)

MongoDB to BigQuery Batch Job

What happened?

I have a Dataflow job using a default template to transfer data from MongoDB to BigQuery (Batch).

The job works with Option=NONE but not Option=Flatten.

For FLATTEN option, the 'Write to BigQuery' step fails. Could you please help?

Beam Version

Newer than 2.46.0

Relevant log output

Error message from worker: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3_10959096681912228204186652e04301_00001_00000, reached max retries: 3, last failed job: { "configuration" : { "jobType" : "LOAD", "labels" : { "beam_job_id" : "2023-07-20_13_22_38-14520935356309780491" }, "load" : { "createDisposition" : "CREATE_IF_NEEDED", "destinationTable" : { "datasetId" : "back4app_collections", "projectId" : "sober-sidekick", "tableId" : "user" }, "ignoreUnknownValues" : false, "sourceFormat" : "NEWLINE_DELIMITED_JSON", "useAvroLogicalTypes" : false, "writeDisposition" : "WRITE_APPEND" } }, "etag" : "A3T1le5NNrZrr4R+0l4vqg==", "id" : "sober-sidekick:US.beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3_10959096681912228204186652e04301_00001_00000-8", "jobReference" : { "jobId" : "beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3_10959096681912228204186652e04301_00001_00000-8", "location" : "US", "projectId" : "sober-sidekick" }, "kind" : "bigquery#job", "selfLink" : "https://bigquery.googleapis.com/bigquery/v2/projects/sober-sidekick/jobs/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3_10959096681912228204186652e04301_00001_00000-8?location=US", "statistics" : { "completionRatio" : 0.0, "creationTime" : "1689884936937", "endTime" : "1689884938478", "reservation_id" : "default-pipeline", "startTime" : "1689884937190", "totalSlotMs" : "598" }, "status" : { "errorResult" : { "location" : "gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "message" : "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 2; errors: 1. Please look into the errors[] collection for more details. File: gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "reason" : "invalid" }, "errors" : [ { "location" : "gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "message" : "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 2; errors: 1. Please look into the errors[] collection for more details. File: gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "reason" : "invalid" }, { "message" : "Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 2; errors: 1; max bad: 0; error percent: 0", "reason" : "invalid" }, { "location" : "gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "message" : "Error while reading data, error message: JSON parsing error in row starting at position 1018: No such field: lowercased. File: gs://dataflow-staging-us-west3-97140448785/tmp/BigQueryWriteTemp/beam_bq_job_LOAD_back4appuserflatten_b59573e2cbdf456b8483b8e8799a82a3/436c79da-9b3d-45fa-85e4-505fd30d00f3", "reason" : "invalid" } ], "state" : "DONE" }, "user_email" : "[email protected]", "principal_subject" : "serviceAccount:[email protected]" }. org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:201) org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:154) org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:381)

elaamrani avatar Jul 20 '23 20:07 elaamrani

I think this fails when there are fields in the mongo collection which are optional, and while flattening it expects all fields to be always available.

ashishjh-bst avatar Jul 26 '23 12:07 ashishjh-bst

Did you find a solution or a workaround? I'm facing the same issue :/

gaurikapse avatar Dec 18 '23 12:12 gaurikapse

Did you find a solution or a workaround? I'm facing the same issue :/

@gaurikapse No, I wrote my own solution for it.

ashishjh-bst avatar Dec 18 '23 13:12 ashishjh-bst

Hi! i just ran into this exact problem? Could you be so kind to share what kind of solution you wrote to solve it @elaamrani @ashishjh-bst @gaurikapse

raymsimhi avatar Jan 08 '24 14:01 raymsimhi

Hi! i just ran into this exact problem? Could you be so kind to share what kind of solution you wrote to solve it @elaamrani @ashishjh-bst @gaurikapse

I wrote this, which fits my usecase, won't work for collections with a lot of data. https://github.com/ashishjh-bst/MongoToBigQueryETL

ashishjh-bst avatar Jan 08 '24 14:01 ashishjh-bst

Hi! i just ran into this exact problem? Could you be so kind to share what kind of solution you wrote to solve it @elaamrani @ashishjh-bst @gaurikapse

I wrote this, which fits my usecase, won't work for collections with a lot of data. https://github.com/ashishjh-bst/MongoToBigQueryETL

Thank you, I am actually new to javascript so forgive my noob question. I looked at the code, and am wondering if it will work for as UDF function on dataflow because I see several inputs are needed as environment variables.

raymsimhi avatar Jan 08 '24 14:01 raymsimhi

Yes, I think the problem is this

That is used as the bigquery destination table schema but it is just the schema of the first document in the collection which may be short some properties relative to later documents in the collection

ggprod avatar Jun 30 '24 02:06 ggprod