DataflowTemplates
DataflowTemplates copied to clipboard
[Bug]: UDF does not get applied in the 'MongoDB to BigQuery' Batch Template
Related Template(s)
MongoDB to BigQuery Batch Job
What happened?
I created a "MongoDB to BigQuery" using Google's default template with NONE (instead of FLATTEN) and added a UDF with the following function
function process(inJson) {
const obj = JSON.parse(inJson);
obj.newField = 1;
return JSON.stringify(obj);
}
I expected to have 4 columns (id, source_data, timestamp and newField). However, I only got 3 without the newField column.
Thank you for your help!
Beam Version
Newer than 2.46.0
Relevant log output
NA
This is because they expect you to give a document instead of a stringified json text. This isn't documented anywhere except on the readme on this repo for that template.
Additionally the tutorial in which they talk about this, the example of the UDF is as follows
function transform(inputDoc) {
var outputDoc = new Object();
inputDoc["City"] = inputDoc["Address"]["City"];
delete doc.Address;
outputDoc = doc;
return returnObj;
}
I am not sure what is going on in this:
inputDoc["City"] = inputDoc["Address"]["City"];
All the modifications happen on inputDoc
.
but the delete is on doc
which isn't defined anywhere
and then doc is copied to outputDoc
but then outputDoc
is returned, instead a new variable returnDoc
is returned from the UDF.
@theshanbhag can you provide a better and a working example here, since you are a co-author on the Tutorial.
I think UDF is broken for (at least) the batch template. I can't get it to work.
@wieringen
I think UDF is broken for (at least) the batch template. I can't get it to work.
I was able to get it to work, but this is too inflexible and breaks if your collection has documents with optional fields, I ended up writing my own solution do this ETL.
Luckily I don't have optional fields, how did you get it to work? The tutorial doesn't make any sense (like you already pointed out). The snippet of code in the readme is also strange.
https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/b272af112ef40a0ac75ccc4cccf2a422921f3514/v2/mongodb-to-googlecloud/docs/MongoDbToBigQuery/README.md?plain=1#L165
/**
* A transform which adds a field to the incoming data.
* @param {Document} doc
* @return {Document} returnObj
*/
function transform(doc) {
var obj = doc;
var returnObj = new Object();
return returnObj;
}
returnObj is not a BSON document and obj is not referenced anywhere.
/**
* A transform which adds a field to the incoming data.
* @param {Document} doc
* @return {Document} returnObj
*/
function transform(doc) {
delete doc.name;
return doc;
}
The code above "works" in the sense that it runs, but no transform is applied.
@theshanbhag Do you have an idea how to implement this correctly?
/**
* A transform which adds a field to the incoming data.
* @param {Document} doc
* @return {Document} returnObj
*/
function transform(doc) {
var parsedDoc = JSON.parse(doc)
//do stuff here
parseDoc.z = "test"
//return after stringifying
return JSON.stringify(parsedDoc);
}
Something like this worked for me @wieringen, you have to stringify the doc in return otherwise it doesn't work.
In case you to end up getting frustrated and dropping it, This is my solution. https://github.com/ashishjh-bst/MongoToBigQueryETL
Thanks, I will try it out!
https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/b272af112ef40a0ac75ccc4cccf2a422921f3514/v2/mongodb-to-googlecloud/src/test/java/com/google/cloud/teleport/v2/mongodb/templates/MongoDbToBigQueryIT.java#L128
Looks like the same pattern is being used in the unit tests.
This issue has been marked as stale due to 180 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the issue at any time. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.