beam icon indicating copy to clipboard operation
beam copied to clipboard

Beam SQL Extension raised an error when the input row contained iterable fields

Open yyfhust opened this issue 1 year ago • 1 comments

Please add a meaningful description for your change here

Related to : https://github.com/apache/beam/issues/31118

Upon utilizing the Beam SQL within our pipeline, we encountered a noteworthy exception. It appears that when the input row encompasses fields of iterable types, the process fails irrespective of whether the iterable fields are included in the SQL filter condition or not. This issue stems from the fact that the Beam SQL extension attempts to construct an output row schema based on the input schema, and unfortunately, it currently lacks support for iterable types.

Consider the following example:

Given an inputRow in the schema:

field1: String
field2: Integer
field3: Array<String>
field4: ITERABLE

And ANY Beam SQL condition such as : field2 > 1 or Even 1 = 1

The pipeline will invariably fail, yielding the following error: Exception in thread "main" java.lang.UnsupportedOperationException: Unable to get ITERABLE at org.apache.beam.sdk.extensions.sql.impl.rel.BeamCalcRel$InputGetterImpl.getBeamField(BeamCalcRel.java:603).

(My first contribution to beam , kindly advise how to test lol)


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [x] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [ ] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

yyfhust avatar Apr 26 '24 14:04 yyfhust

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions[bot] avatar Apr 26 '24 15:04 github-actions[bot]

thanks for the fix. While this sounds a valid fix, would you mind sharing the full stack trace. My understanding here is that there are two issues

(1) Beam SQL does not filter out fields not used

(2) Iterable isn't supported by Beam SQL

A full stack will be helpful to investigate (1), and possible optimization

The Iterable field type, introduced in #10003 was meant to be different than ARRAY. However the fix here treats it the same as ARRAY. It may have performance implications, and not work for large iterables? Maybe add a comment here or a TODO.

until it's optimized for Iterable, one can just write

case ARRAY:
case ITERABLE:
    return ....

so no need duplicate the line.

Also, there are switch (fieldType.getTypeName()) branches in several places in BeamCalRel, could it be all fixed for consistency?

Abacn avatar May 08 '24 19:05 Abacn

@Abacn Sorry, I missed your comments. Was busy with the work and I completely forgot about this issue and did not follow up. I just saw that https://github.com/apache/beam/pull/31588/files fixed this issue.

I will close this PR."

yyfhust avatar Jun 13 '24 13:06 yyfhust