nifi
nifi copied to clipboard
NIFI-13843 - use record.isDropUnknownFields() to honor JSON Writer Schema
Summary
Consider the following use case:
- GFF Processor, generating a JSON with 3 fields:
a,b, andc - ConvertRecord with JSON Reader / JSON Writer
- Both reader and writer are configured with a schema only specifying fields
aandb
- Both reader and writer are configured with a schema only specifying fields
The expected result is a JSON that only contains fields a and b.
We're following the below path in the code:
AbstractRecordProcessor(L131)
Record firstRecord = reader.nextRecord();
In this case, the default method for nextRecord() is defined in RecordReader (L50)
default Record nextRecord() throws IOException, MalformedRecordException {
return nextRecord(true, false);
}
where we are NOT dropping the unknown fields (Java doc needs some fixing here as it is saying the opposite)
We get to
writer.write(firstRecord);
which gets us to
WriteJsonResult(L206)
Here, we do a check
isUseSerializeForm(record, writeSchema)
which currently returns true when it should not. Because of this we write the serialised form which ignores the writer schema.
In this method isUseSerializeForm(), we do check
record.getSchema().equals(writeSchema)
But at this point record.getSchema() returns the schema defined in the reader which is equal to the one defined in the writer - even though the record has additional fields compared to the defined schema.
The suggested fix is to also add a check on
record.isDropUnknownFields()
If dropUnknownFields is false, then we do not use the serialised form.
While this approach does solve the problem, I suspect that it may have a performance impact given the default on nextRecord() which currently keeps the unknown fields.
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
- [ ] Apache NiFi Jira issue created
Pull Request Tracking
- [ ] Pull Request title starts with Apache NiFi Jira issue number, such as
NIFI-00000 - [ ] Pull Request commit message starts with Apache NiFi Jira issue number, as such
NIFI-00000
Pull Request Formatting
- [ ] Pull Request based on current revision of the
mainbranch - [ ] Pull Request refers to a feature branch with one commit containing changes
Verification
Please indicate the verification steps performed prior to pull request creation.
Build
- [ ] Build completed using
mvn clean install -P contrib-check- [ ] JDK 21
Licensing
- [ ] New dependencies are compatible with the Apache License 2.0 according to the License Policy
- [ ] New dependencies are documented in applicable
LICENSEandNOTICEfiles
Documentation
- [ ] Documentation formatting appears as expected in rendered files
While this does solve the issue, I'm a bit conflicted on the current approach. Not only this could have a performance impact (we are likely going to not use the serialized form as often), but it also feels like the default should be to ignore the unknown fields when reading the record.
If we consider the below scenario:
- GFF Processor, generating a JSON with 3 fields:
a,b, andc - ConvertRecord with JSON Reader / JSON Writer
- JSON reader with a schema only specifying fields
aandb - JSON writer with a schema specifying fields
a,b, andc(cdefaulting tonull)
- JSON reader with a schema only specifying fields
It feels like the expected result should be a JSON with the field c and a null value, because the reader would drop the field when reading the JSON and converting it into a record and pass it to the writer.
If we agree on the above, then it may be easier to juste override nextRecord() in AbstractJsonRowRecordReader and default to nextRecord(true, true).
While this does solve the issue, I'm a bit conflicted on the current approach. Not only this could have a performance impact (we are likely going to not use the serialized form as often), but it also feels like the default should be to ignore the unknown fields when reading the record.
If we consider the below scenario:
* GFF Processor, generating a JSON with 3 fields: `a`, `b`, and `c` * ConvertRecord with JSON Reader / JSON Writer * JSON reader with a schema only specifying fields `a` and `b` * JSON writer with a schema specifying fields `a`, `b`, and `c` (`c` defaulting to `null`)It feels like the expected result should be a JSON with the field
cand anullvalue, because the reader would drop the field when reading the JSON and converting it into a record and pass it to the writer.If we agree on the above, then it may be easier to juste override
nextRecord()inAbstractJsonRowRecordReaderand default tonextRecord(true, true).
Thanks for summarizing the options and highlighting this alternative approach @pvillard31.
Based on your summary, I agree that changing the behavior of nextRecord() to set dropUnknownFields to true in AbstractJsonRowRecordReader seems like a better way forward.
Yeah, I think I would also agree that defaulting the dropUnknownFields to true makes the most sense. One place where I can think of where we need to be careful that we're doing the right thing is ValidateRecord but that Processor is explicitly passing a value for the field:
while ((record = reader.nextRecord(coerceTypes, false)) != null) {
...
}
And any case in which we infer the schema, we should have an entry for every field that is found because we infer the schema based on scanning all records in the flowfile first.
So I think it makes sense to drop unknown fields. It also is consistent with what the JavaDocs state will happen :) So I'm a +1 on that approach.
Thanks @markap14 - will update the PR shortly
Changing the default has impacts across many unit tests and I'll need to spend more time looking into this. I just pushed what I currently have and I switched this PR to draft mode.
Thanks for moving this forward @pvillard31. Reviewing the changes locally, I found the dropUnknownFields check in MapRecord was blocking the record field from being updated. Removing that check seems to have yielded the expected behavior, but still evaluating the impact. I pushed a commit with that change, along with optimizing the field removal handling to avoid creating the deep copy of the JsonNode when that is not necessary.
Thanks @exceptionfactory @pvillard31 I think the changes here make a lot of sense. I did some testing locally with ConvertRecord and UpdateRecord and everything appears to work as expected. Also ran a quick performance comparison before/after these changes and saw roughly the same numbers so all good there. I'm a +1 on these changes.