nifi
nifi copied to clipboard
NIFI-12674 Modified ValidateCSV to make the schema optional if a head…
…er is provided. Added validate on attribute option.
Summary
Made the schema optional for the ValidateCSV processor if a header is provided. In this case, only the structure of the CSV will validated, using the header to determine how many fields each line should have.
Additionally, a new validation strategy was implemented, Validate on Attribute. This works similar to the way ValidateXML works in which the value of a given attribute of a FlowFile will be treated as the contents of a CSV file. Validation will be done on that attribute and not on the content of the FlowFile.
I would also like to note that I made the stream a variable that can be assigned to either the FlowFile content or the value of the attribute. In doing this, I removed the need for an inner method, and thus all of the variables that previously needed to be Atomic References could now be regular variables.
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
- [X] Apache NiFi Jira issue created
Pull Request Tracking
- [X] Pull Request title starts with Apache NiFi Jira issue number, such as
NIFI-00000 - [X] Pull Request commit message starts with Apache NiFi Jira issue number, as such
NIFI-00000
Pull Request Formatting
- [X] Pull Request based on current revision of the
mainbranch - [ ] Pull Request refers to a feature branch with one commit containing changes
Verification
Please indicate the verification steps performed prior to pull request creation.
Build
- [X] Build completed using
mvn clean install -P contrib-check- [X] JDK 21
Licensing
- [ ] New dependencies are compatible with the Apache License 2.0 according to the License Policy
- [ ] New dependencies are documented in applicable
LICENSEandNOTICEfiles
Documentation
- [X] Documentation formatting appears as expected in rendered files
Any reason for not using ValidateRecord processor if the only requirement is to confirm that the data is valid CSV without any specific constraint?
@pvillard31 The ValidateRecord processor splits the input FlowFile into 2 Flowfiles, one for valid and another for invalid records. With the change to ValidateCSV, the whole file will be routed to either valid or invalid.
@exceptionfactory Can you please restart the failed job? It does not seem related to the changes. Thanks!
Any updates on reviewing this change?
There are merge conflicts that need to be resolved
@mattyb149 I've rebased against main. Thank You
Automated review is marking this PR as stale due to lack of updates in the past four months. This PR will be closed in 15 days if the stale label is not removed. This stale label and automated closure does not indicate a judgement of the PR, just lack of reviewer bandwidth and helps us keep the PR queue more manageable. If you would like this PR re-opened you can do so and a committer can remove the stale label. Or you can open a new PR. Try to help review other PRs to increase PR review bandwidth which in turn helps yours.
@mosermw noting that you removed the stale label, are you planning to review this pull request? Although this has had some activity, the lack involvement over four months seems to indicate that it should be closed, with the option to revisit if there is a positive comment on the Jira issue that a committer is planning to review and guide this to completion.
@mosermw noting that you removed the
stalelabel, are you planning to review this pull request?
It looks like Matt B. made some comments that should be addressed, along with the merge conflicts. I contacted the PR author to ask if they still want this change, and will be happy to work with them to finish this.
Thanks @mosermw!
@mosermw @exceptionfactory @mattyb149 Sorry about losing track of this. I've rebased and addressed the comments provided.
Reviewing ...
Thanks for making these changes @Freedom9339. One or two last things that I noticed.
When validating a CSV Source Attribute, we can set the "Validation Strategy = Line by line". When we do this, the flowfile content is replaced with the value of the CSV attribute. This is clearly not good! I cannot think of a use case where we want to allow "Validation Strategy = Line by line" on an attribute ... it should always validate the whole attribute and not change anything.
We can make the CSV Source Attribute depend on "Validation Strategy = FlowFile" but then we would have to defend ourselves from the case when someone enters a CSV Source Attribute and then changes the Validation Strategy. The CSV Source Attribute will disappear and I'm worried that it will still be set and we will still try to use it.
So, if using dependsOn() gets too complicated, just add something to customValidate to disallow CSV Source Attribute with "Validation Strategy = Line by line"
@mosermw I've added the depends on and made the validate attribute only show when using whole ff validation. I've added a check to make sure whole ff validation is being used before using the attribute as a source.
Thanks for sticking with this @Freedom9339. Everything looks goo now.
+1 merging.