DataflowTemplates icon indicating copy to clipboard operation
DataflowTemplates copied to clipboard

[Bug]: UDF does not get applied in the 'MongoDB to BigQuery' Batch Template

Open elaamrani opened this issue 1 year ago • 6 comments

Related Template(s)

MongoDB to BigQuery Batch Job

What happened?

I created a "MongoDB to BigQuery" using Google's default template with NONE (instead of FLATTEN) and added a UDF with the following function

function process(inJson) {
  const obj = JSON.parse(inJson);

  obj.newField = 1;

  return JSON.stringify(obj);
}

I expected to have 4 columns (id, source_data, timestamp and newField). However, I only got 3 without the newField column.

Thank you for your help!

Beam Version

Newer than 2.46.0

Relevant log output

NA

elaamrani avatar Jul 20 '23 22:07 elaamrani

This is because they expect you to give a document instead of a stringified json text. This isn't documented anywhere except on the readme on this repo for that template.

Additionally the tutorial in which they talk about this, the example of the UDF is as follows


function transform(inputDoc) {
   var outputDoc = new Object();
   inputDoc["City"] = inputDoc["Address"]["City"];
   delete doc.Address;
   outputDoc = doc;
   return returnObj;
}

I am not sure what is going on in this:

  inputDoc["City"] = inputDoc["Address"]["City"];

All the modifications happen on inputDoc. but the delete is on doc which isn't defined anywhere and then doc is copied to outputDoc but then outputDoc is returned, instead a new variable returnDoc is returned from the UDF.

@theshanbhag can you provide a better and a working example here, since you are a co-author on the Tutorial.

ashishjh-bst avatar Jul 26 '23 16:07 ashishjh-bst

I think UDF is broken for (at least) the batch template. I can't get it to work.

wieringen avatar Aug 25 '23 13:08 wieringen

@wieringen

I think UDF is broken for (at least) the batch template. I can't get it to work.

I was able to get it to work, but this is too inflexible and breaks if your collection has documents with optional fields, I ended up writing my own solution do this ETL.

ashishjh-bst avatar Aug 25 '23 13:08 ashishjh-bst

Luckily I don't have optional fields, how did you get it to work? The tutorial doesn't make any sense (like you already pointed out). The snippet of code in the readme is also strange.

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/b272af112ef40a0ac75ccc4cccf2a422921f3514/v2/mongodb-to-googlecloud/docs/MongoDbToBigQuery/README.md?plain=1#L165

/**
 * A transform which adds a field to the incoming data.
 * @param {Document} doc
 * @return {Document} returnObj
 */
 function transform(doc) {
    var obj = doc;
    var returnObj = new Object();
    return returnObj;
  }

returnObj is not a BSON document and obj is not referenced anywhere.

/**
 * A transform which adds a field to the incoming data.
 * @param {Document} doc
 * @return {Document} returnObj
 */
 function transform(doc) {
    delete doc.name;
    return doc;
  }

The code above "works" in the sense that it runs, but no transform is applied.

@theshanbhag Do you have an idea how to implement this correctly?

wieringen avatar Aug 25 '23 13:08 wieringen

/**
 * A transform which adds a field to the incoming data.
 * @param {Document} doc
 * @return {Document} returnObj
 */
 function transform(doc) {
    var parsedDoc = JSON.parse(doc)
    //do stuff here 
    parseDoc.z = "test"
    
    //return after stringifying
    return JSON.stringify(parsedDoc);
  }

Something like this worked for me @wieringen, you have to stringify the doc in return otherwise it doesn't work.

In case you to end up getting frustrated and dropping it, This is my solution. https://github.com/ashishjh-bst/MongoToBigQueryETL

ashishjh-bst avatar Aug 25 '23 13:08 ashishjh-bst

Thanks, I will try it out!

https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/b272af112ef40a0ac75ccc4cccf2a422921f3514/v2/mongodb-to-googlecloud/src/test/java/com/google/cloud/teleport/v2/mongodb/templates/MongoDbToBigQueryIT.java#L128

Looks like the same pattern is being used in the unit tests.

wieringen avatar Aug 25 '23 13:08 wieringen

This issue has been marked as stale due to 180 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the issue at any time. Thank you for your contributions.

github-actions[bot] avatar May 20 '24 14:05 github-actions[bot]

This issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar May 28 '24 02:05 github-actions[bot]