nifi icon indicating copy to clipboard operation
nifi copied to clipboard

NIFI-13843 - use record.isDropUnknownFields() to honor JSON Writer Schema

Open pvillard31 opened this issue 1 year ago • 2 comments

Summary

NIFI-13843

Consider the following use case:

  • GFF Processor, generating a JSON with 3 fields: a, b, and c
  • ConvertRecord with JSON Reader / JSON Writer
    • Both reader and writer are configured with a schema only specifying fields a and b

The expected result is a JSON that only contains fields a and b.

We're following the below path in the code:

  • AbstractRecordProcessor (L131)
Record firstRecord = reader.nextRecord(); 

In this case, the default method for nextRecord() is defined in RecordReader (L50)

default Record nextRecord() throws IOException, MalformedRecordException {
    return nextRecord(true, false);
} 

where we are NOT dropping the unknown fields (Java doc needs some fixing here as it is saying the opposite)

We get to

writer.write(firstRecord); 

which gets us to

  • WriteJsonResult (L206)

Here, we do a check

isUseSerializeForm(record, writeSchema) 

which currently returns true when it should not. Because of this we write the serialised form which ignores the writer schema.

In this method isUseSerializeForm(), we do check

record.getSchema().equals(writeSchema) 

But at this point record.getSchema() returns the schema defined in the reader which is equal to the one defined in the writer - even though the record has additional fields compared to the defined schema.

The suggested fix is to also add a check on

record.isDropUnknownFields() 

If dropUnknownFields is false, then we do not use the serialised form.

While this approach does solve the problem, I suspect that it may have a performance impact given the default on nextRecord() which currently keeps the unknown fields.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • [ ] Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • [ ] Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • [ ] Pull Request based on current revision of the main branch
  • [ ] Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • [ ] Build completed using mvn clean install -P contrib-check
    • [ ] JDK 21

Licensing

  • [ ] New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • [ ] New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • [ ] Documentation formatting appears as expected in rendered files

pvillard31 avatar Oct 04 '24 15:10 pvillard31

While this does solve the issue, I'm a bit conflicted on the current approach. Not only this could have a performance impact (we are likely going to not use the serialized form as often), but it also feels like the default should be to ignore the unknown fields when reading the record.

If we consider the below scenario:

  • GFF Processor, generating a JSON with 3 fields: a, b, and c
  • ConvertRecord with JSON Reader / JSON Writer
    • JSON reader with a schema only specifying fields a and b
    • JSON writer with a schema specifying fields a, b, and c (c defaulting to null)

It feels like the expected result should be a JSON with the field c and a null value, because the reader would drop the field when reading the JSON and converting it into a record and pass it to the writer.

If we agree on the above, then it may be easier to juste override nextRecord() in AbstractJsonRowRecordReader and default to nextRecord(true, true).

pvillard31 avatar Oct 04 '24 17:10 pvillard31

While this does solve the issue, I'm a bit conflicted on the current approach. Not only this could have a performance impact (we are likely going to not use the serialized form as often), but it also feels like the default should be to ignore the unknown fields when reading the record.

If we consider the below scenario:

* GFF Processor, generating a JSON with 3 fields: `a`, `b`, and `c`

* ConvertRecord with JSON Reader / JSON Writer
  
  * JSON reader with a schema only specifying fields `a` and `b`
  * JSON writer with a schema specifying fields `a`, `b`, and `c` (`c` defaulting to `null`)

It feels like the expected result should be a JSON with the field c and a null value, because the reader would drop the field when reading the JSON and converting it into a record and pass it to the writer.

If we agree on the above, then it may be easier to juste override nextRecord() in AbstractJsonRowRecordReader and default to nextRecord(true, true).

Thanks for summarizing the options and highlighting this alternative approach @pvillard31.

Based on your summary, I agree that changing the behavior of nextRecord() to set dropUnknownFields to true in AbstractJsonRowRecordReader seems like a better way forward.

exceptionfactory avatar Oct 21 '24 12:10 exceptionfactory

Yeah, I think I would also agree that defaulting the dropUnknownFields to true makes the most sense. One place where I can think of where we need to be careful that we're doing the right thing is ValidateRecord but that Processor is explicitly passing a value for the field:

while ((record = reader.nextRecord(coerceTypes, false)) != null) {
    ...
}

And any case in which we infer the schema, we should have an entry for every field that is found because we infer the schema based on scanning all records in the flowfile first.

So I think it makes sense to drop unknown fields. It also is consistent with what the JavaDocs state will happen :) So I'm a +1 on that approach.

markap14 avatar Oct 28 '24 19:10 markap14

Thanks @markap14 - will update the PR shortly

pvillard31 avatar Oct 28 '24 19:10 pvillard31

Changing the default has impacts across many unit tests and I'll need to spend more time looking into this. I just pushed what I currently have and I switched this PR to draft mode.

pvillard31 avatar Oct 28 '24 21:10 pvillard31

Thanks for moving this forward @pvillard31. Reviewing the changes locally, I found the dropUnknownFields check in MapRecord was blocking the record field from being updated. Removing that check seems to have yielded the expected behavior, but still evaluating the impact. I pushed a commit with that change, along with optimizing the field removal handling to avoid creating the deep copy of the JsonNode when that is not necessary.

exceptionfactory avatar Oct 29 '24 13:10 exceptionfactory

Thanks @exceptionfactory @pvillard31 I think the changes here make a lot of sense. I did some testing locally with ConvertRecord and UpdateRecord and everything appears to work as expected. Also ran a quick performance comparison before/after these changes and saw roughly the same numbers so all good there. I'm a +1 on these changes.

markap14 avatar Oct 29 '24 19:10 markap14