beam icon indicating copy to clipboard operation
beam copied to clipboard

adding examples in schema transforms section of programming guide for python (changes for issue #21022)

Open smeet07 opened this issue 2 years ago • 21 comments

addresses #21022 In the section "Using Schema Transforms" of the Python programming guide, there are missing examples. I've written the examples for top-level fields, nested fields and wildcards

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [ ] Choose reviewer(s) and mention them in a comment (R: @username).
  • [ ] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [ ] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI.

smeet07 avatar Sep 13 '22 22:09 smeet07

I was actually thinking about the same, if we could represent it in a kind off pretty json format it would provide better understandability of the input

smeet07 avatar Sep 14 '22 20:09 smeet07

Minor note: please, update this PRs title to properly reflect what it's supposed to fix/improve. Thanks!

aromanenko-dev avatar Sep 15 '22 15:09 aromanenko-dev

Hm the Select transform works differently in Python, I think we should actually have different language for this entire section for Python.

TheNeuralBit avatar Sep 15 '22 16:09 TheNeuralBit

In general our Python schema transforms are different, we might want an entirely new "Using Schema Transforms" section specific to Python that discusses the GroupBy and Select transforms.

@smeet07 would you want to take that on? Instead a shorter-term solution could be to just hide this section for Python to avoid confusing users.

TheNeuralBit avatar Sep 15 '22 16:09 TheNeuralBit

Hm the Select transform works differently in Python, I think we should actually have different language for this entire section for Python. Could you elaborate please? Is the syntax different or the whole working different? Also could we show the working of top level fields , nested fields and wildcards using map instead of select?

smeet07 avatar Sep 15 '22 18:09 smeet07

In general our Python schema transforms are different, we might want an entirely new "Using Schema Transforms" section specific to Python that discusses the GroupBy and Select transforms.

@smeet07 would you want to take that on? Instead a shorter-term solution could be to just hide this section for Python to avoid confusing users.

I would like to take that on but I am confused as how the python schema transforms are different from java schema transforms? Also is the syntax and use of select transform in 6.5.1 https://beam.apache.org/documentation/programming-guide/#schemas inferring schemas section correct?

smeet07 avatar Sep 15 '22 18:09 smeet07

Could you elaborate please? Is the syntax different or the whole working different?

Java's Select transform just allows projecting fields - it allows users to select fields by name or ID, possibly with nested fields separated by '.'.

In Python, the Select transform does allow this style:

beam.Select("userId", "eventId")

But it also allows users to declare new fields with arbitrary expressions using lambdas (the style you've used in your examples):

beam.Select(computedField=lambda row: row.userId + row.eventId)

The styles can be mixed and matched too:

beam.Select("userId", computedField=lambda row: row.userId + row.eventId)

Perhaps we could have common documentation for both Java and Python that just discusses the field name selection style (and uses it in Python examples, rather than lambdas). Then we could add additional documentation for the lambda style at a later date.

The only gotcha is that I'm pretty sure the nested field syntax is not implemented in Python today. We could file an issue for that and link to it from the Programming Guide though.

TheNeuralBit avatar Sep 16 '22 13:09 TheNeuralBit

Oh okay got it that's what I was wondering whether it was nested fields and wildcards which didn't work in python, I guess we could remove this section for python for now (java examples are already present) and create new section which explains the working of arbitrary expressions in python using lambdas

smeet07 avatar Sep 16 '22 14:09 smeet07

Oh okay got it that's what I was wondering whether it was nested fields and wildcards which didn't work in python, I guess we could remove this section for python for now (java examples are already present) and create new section which explains the working of arbitrary expressions in python using lambdas

That sounds good to me. It would be nice to have parity in Python's Select with a common syntax at some point too. I filed #23275 to track this.

TheNeuralBit avatar Sep 16 '22 15:09 TheNeuralBit

Thanks for your patience @smeet07! I appreciate the contribution :)

TheNeuralBit avatar Sep 16 '22 15:09 TheNeuralBit

no problem! , should I remove this section from python for now or should we wait for the new feature and just add the examples later ?

smeet07 avatar Sep 16 '22 15:09 smeet07

Let's hide it for now to avoid confusing users

TheNeuralBit avatar Sep 16 '22 16:09 TheNeuralBit

alright on it

smeet07 avatar Sep 16 '22 17:09 smeet07

@TheNeuralBit every example in section 6.6 (maps, grouping aggregations, joins, complex joins uses nested fields, should I keep the whole thing specific to java ?

smeet07 avatar Sep 16 '22 18:09 smeet07

There was no language tag in each paragraph so the paragraph was shown irrespective of language selected I thought by adding {{< paragraph class="language-java" >}} the paragraph would be only visible when language selected is java, I don't know why I didn't work because the paragraphs above which were only visible to java had java tags in them

Also yes we'll add java tags for now and when python syntax is ready we'll add the python tags

smeet07 avatar Sep 19 '22 19:09 smeet07

I'll look more into paragraph tags and see why it didn't work

smeet07 avatar Sep 19 '22 19:09 smeet07

@yeandy I've added paragraphs for python saying the support hasn't been developed yet. Could you check whether there are visual changes or not

smeet07 avatar Sep 19 '22 21:09 smeet07

@yeandy I've added paragraphs for python saying the support hasn't been developed yet. Could you check whether there are visual changes or not

The Website_Stage_GCS precommit builds the website and stages it so we can preview the results. You can find a link to the site when you click "Details" next to that check. It is here: http://apache-beam-website-pull-requests.storage.googleapis.com/23224/index.html

The section you're editing looks like: image

Can you also add text for the Go SDK? It shows up with empty headings right now.

TheNeuralBit avatar Sep 20 '22 18:09 TheNeuralBit

Oh thanks I didn't know that I'll add texts for python and Go wherever the examples can't be added yet (which is the case for most examples in section 6.6)

smeet07 avatar Sep 21 '22 13:09 smeet07

@TheNeuralBit in section 6.6.2 of programming guide under grouping aggregations, they have used GROUP transform in java for selecting multiple fields, should we show how select in python can be used here for multiple fields?

smeet07 avatar Sep 21 '22 14:09 smeet07

Also has support for joins and complex joins been developed yet for python and GO?

smeet07 avatar Sep 21 '22 14:09 smeet07

@TheNeuralBit in section 6.6.2 of programming guide under grouping aggregations, they have used GROUP transform in java for selecting multiple fields, should we show how select in python can be used here for multiple fields?

I think the GroupBy transform is the best analogue in Python. There are some usage examples here

Also has support for joins and complex joins been developed yet for python and GO?

No neither Python nor Go has a high-level Join transform.

TheNeuralBit avatar Sep 23 '22 22:09 TheNeuralBit

@TheNeuralBit In combine functions of aggregate fields, there is sum and meancombinefn, is there any other I should know of?

smeet07 avatar Sep 30 '22 20:09 smeet07

@TheNeuralBit In combine functions of aggregate fields, there is sum and meancombinefn, is there any other I should know of?

Any CombineFn implementation, like you would use in the Combine transform, can be used to aggregate fields.

TheNeuralBit avatar Sep 30 '22 21:09 TheNeuralBit

@TheNeuralBit I added some python code for select and groupBy transform but it doesn't show up while selecting python, instead shows up after selecting java, could you identify where might have I gone wrong ?

smeet07 avatar Sep 30 '22 22:09 smeet07

@TheNeuralBit what changes should I do next?

smeet07 avatar Oct 15 '22 18:10 smeet07

R: @TheNeuralBit

smeet07 avatar Oct 25 '22 15:10 smeet07

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

github-actions[bot] avatar Oct 25 '22 17:10 github-actions[bot]

retest this please

TheNeuralBit avatar Oct 27 '22 23:10 TheNeuralBit

the console output of the whitespace check shows this image but there is no blank line or any whitespace according me in image

smeet07 avatar Oct 27 '22 23:10 smeet07