beam
beam copied to clipboard
Fix(yaml): Handle missing optional fields in JSON parsing
Fixes #35179
When using ReadFromPubSub with a schema in Beam YAML, the pipeline would fail with a KeyError if a field specified in the schema was missing from the incoming JSON message.
This commit fixes the issue by modifying the json_to_row function in apache_beam/yaml/json_utils.py. The direct dictionary access value[name] is replaced with value.get(name) to safely handle missing keys, returning None instead of raising an error.
The converters for array, map, and row types have also been made robust to handle None values, which can occur for missing optional fields of these complex types.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
- [ ] Mention the appropriate issue in your description (for example:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead. - [ ] Update
CHANGES.mdwith noteworthy changes. - [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.
See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.
@jonathaningram possible to validate this PR from your side? Feel free to review it as well.
Assigning reviewers:
R: @claudevdm for label python.
Note: If you would like to opt out of this review, comment assign to next reviewer.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
The PR bot will only process comments in the main thread (not review comments).
The code looks right to me and I tested a Dataflow Beam YAML pipeline without this change and with it. I can confirm the key error goes away with this change. I tested a missing Pub/Sub message field but not a missing attribute. I assume it works for both though.
Maybe there could be a corresponding docs update to go with this PR, e.g., tell users what happens if they have missing fields, but leave that with you to decide on.
Good idea. Added this to CHANGES.md.
@robertwb is it even possible to define an optional or non-optional field? As described in the original issue #35179, I couldn't work out how to specify "required-ness" on my schema.
I did notice in the tests in this PR that nullable was being used and I was going to comment on whether that's something that external pipeline authors are meant to be able to configure, but I removed my comment because I decided that maybe the nullable was just to help set up a schema for the tests (and I could prove the fix worked e2e in Dataflow).
By default, all properties in a json schema are optional; to declare them otherwise one uses the required field: https://json-schema.org/understanding-json-schema/reference/object#required which we respect in Beam: https://github.com/apache/beam/blob/release-2.65/sdks/python/apache_beam/yaml/json_utils.py#L67 .
This function takes as input a schema_pb2.FieldType and should respect whether the types in question are optional (though I'm not saying it might not be to strict now).
Drive by comment: it'd be nice if there was a test ensuring we still fail for non optional fields.
Good point. The original PR indeed did not force this requirement. I updated the code to check the required fields.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 54.51%. Comparing base (
cecfa61) to head (c2f2553). Report is 61 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #35288 +/- ##
==========================================
Coverage 54.50% 54.51%
Complexity 1559 1559
==========================================
Files 1035 1036 +1
Lines 161595 161782 +187
Branches 1139 1139
==========================================
+ Hits 88084 88189 +105
- Misses 71380 71462 +82
Partials 2131 2131
| Flag | Coverage Δ | |
|---|---|---|
| python | 80.82% <100.00%> (-0.07%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.