beam icon indicating copy to clipboard operation
beam copied to clipboard

Fix(yaml): Handle missing optional fields in JSON parsing

Open liferoad opened this issue 5 months ago • 2 comments

Fixes #35179

When using ReadFromPubSub with a schema in Beam YAML, the pipeline would fail with a KeyError if a field specified in the schema was missing from the incoming JSON message.

This commit fixes the issue by modifying the json_to_row function in apache_beam/yaml/json_utils.py. The direct dictionary access value[name] is replaced with value.get(name) to safely handle missing keys, returning None instead of raising an error.

The converters for array, map, and row types have also been made robust to handle None values, which can occur for missing optional fields of these complex types.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [ ] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [ ] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

liferoad avatar Jun 14 '25 19:06 liferoad

@jonathaningram possible to validate this PR from your side? Feel free to review it as well.

liferoad avatar Jun 14 '25 19:06 liferoad

Assigning reviewers:

R: @claudevdm for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar Jun 14 '25 21:06 github-actions[bot]

The code looks right to me and I tested a Dataflow Beam YAML pipeline without this change and with it. I can confirm the key error goes away with this change. I tested a missing Pub/Sub message field but not a missing attribute. I assume it works for both though.

Maybe there could be a corresponding docs update to go with this PR, e.g., tell users what happens if they have missing fields, but leave that with you to decide on.

Good idea. Added this to CHANGES.md.

liferoad avatar Jun 16 '25 14:06 liferoad

@robertwb is it even possible to define an optional or non-optional field? As described in the original issue #35179, I couldn't work out how to specify "required-ness" on my schema.

I did notice in the tests in this PR that nullable was being used and I was going to comment on whether that's something that external pipeline authors are meant to be able to configure, but I removed my comment because I decided that maybe the nullable was just to help set up a schema for the tests (and I could prove the fix worked e2e in Dataflow).

jonathaningram avatar Jun 16 '25 23:06 jonathaningram

By default, all properties in a json schema are optional; to declare them otherwise one uses the required field: https://json-schema.org/understanding-json-schema/reference/object#required which we respect in Beam: https://github.com/apache/beam/blob/release-2.65/sdks/python/apache_beam/yaml/json_utils.py#L67 .

This function takes as input a schema_pb2.FieldType and should respect whether the types in question are optional (though I'm not saying it might not be to strict now).

robertwb avatar Jun 16 '25 23:06 robertwb

Drive by comment: it'd be nice if there was a test ensuring we still fail for non optional fields.

Good point. The original PR indeed did not force this requirement. I updated the code to check the required fields.

liferoad avatar Jun 16 '25 23:06 liferoad

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 54.51%. Comparing base (cecfa61) to head (c2f2553). Report is 61 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##             master   #35288    +/-   ##
==========================================
  Coverage     54.50%   54.51%            
  Complexity     1559     1559            
==========================================
  Files          1035     1036     +1     
  Lines        161595   161782   +187     
  Branches       1139     1139            
==========================================
+ Hits          88084    88189   +105     
- Misses        71380    71462    +82     
  Partials       2131     2131            
Flag Coverage Δ
python 80.82% <100.00%> (-0.07%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jun 19 '25 20:06 codecov[bot]