beam
beam copied to clipboard
adding examples in schema transforms section of programming guide for python (changes for issue #21022)
addresses #21022 In the section "Using Schema Transforms" of the Python programming guide, there are missing examples. I've written the examples for top-level fields, nested fields and wildcards
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
- [ ] Choose reviewer(s) and mention them in a comment (
R: @username
). - [ ] Mention the appropriate issue in your description (for example:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead. - [ ] Update
CHANGES.md
with noteworthy changes. - [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.
See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.
I was actually thinking about the same, if we could represent it in a kind off pretty json format it would provide better understandability of the input
Minor note: please, update this PRs title to properly reflect what it's supposed to fix/improve. Thanks!
Hm the Select
transform works differently in Python, I think we should actually have different language for this entire section for Python.
In general our Python schema transforms are different, we might want an entirely new "Using Schema Transforms" section specific to Python that discusses the GroupBy and Select transforms.
@smeet07 would you want to take that on? Instead a shorter-term solution could be to just hide this section for Python to avoid confusing users.
Hm the
Select
transform works differently in Python, I think we should actually have different language for this entire section for Python. Could you elaborate please? Is the syntax different or the whole working different? Also could we show the working of top level fields , nested fields and wildcards using map instead of select?
In general our Python schema transforms are different, we might want an entirely new "Using Schema Transforms" section specific to Python that discusses the GroupBy and Select transforms.
@smeet07 would you want to take that on? Instead a shorter-term solution could be to just hide this section for Python to avoid confusing users.
I would like to take that on but I am confused as how the python schema transforms are different from java schema transforms? Also is the syntax and use of select transform in 6.5.1 https://beam.apache.org/documentation/programming-guide/#schemas inferring schemas section correct?
Could you elaborate please? Is the syntax different or the whole working different?
Java's Select transform just allows projecting fields - it allows users to select fields by name or ID, possibly with nested fields separated by '.'.
In Python, the Select transform does allow this style:
beam.Select("userId", "eventId")
But it also allows users to declare new fields with arbitrary expressions using lambdas (the style you've used in your examples):
beam.Select(computedField=lambda row: row.userId + row.eventId)
The styles can be mixed and matched too:
beam.Select("userId", computedField=lambda row: row.userId + row.eventId)
Perhaps we could have common documentation for both Java and Python that just discusses the field name selection style (and uses it in Python examples, rather than lambdas). Then we could add additional documentation for the lambda style at a later date.
The only gotcha is that I'm pretty sure the nested field syntax is not implemented in Python today. We could file an issue for that and link to it from the Programming Guide though.
Oh okay got it that's what I was wondering whether it was nested fields and wildcards which didn't work in python, I guess we could remove this section for python for now (java examples are already present) and create new section which explains the working of arbitrary expressions in python using lambdas
Oh okay got it that's what I was wondering whether it was nested fields and wildcards which didn't work in python, I guess we could remove this section for python for now (java examples are already present) and create new section which explains the working of arbitrary expressions in python using lambdas
That sounds good to me. It would be nice to have parity in Python's Select with a common syntax at some point too. I filed #23275 to track this.
Thanks for your patience @smeet07! I appreciate the contribution :)
no problem! , should I remove this section from python for now or should we wait for the new feature and just add the examples later ?
Let's hide it for now to avoid confusing users
alright on it
@TheNeuralBit every example in section 6.6 (maps, grouping aggregations, joins, complex joins uses nested fields, should I keep the whole thing specific to java ?
There was no language tag in each paragraph so the paragraph was shown irrespective of language selected I thought by adding {{< paragraph class="language-java" >}} the paragraph would be only visible when language selected is java, I don't know why I didn't work because the paragraphs above which were only visible to java had java tags in them
Also yes we'll add java tags for now and when python syntax is ready we'll add the python tags
I'll look more into paragraph tags and see why it didn't work
@yeandy I've added paragraphs for python saying the support hasn't been developed yet. Could you check whether there are visual changes or not
@yeandy I've added paragraphs for python saying the support hasn't been developed yet. Could you check whether there are visual changes or not
The Website_Stage_GCS precommit builds the website and stages it so we can preview the results. You can find a link to the site when you click "Details" next to that check. It is here: http://apache-beam-website-pull-requests.storage.googleapis.com/23224/index.html
The section you're editing looks like:
Can you also add text for the Go SDK? It shows up with empty headings right now.
Oh thanks I didn't know that I'll add texts for python and Go wherever the examples can't be added yet (which is the case for most examples in section 6.6)
@TheNeuralBit in section 6.6.2 of programming guide under grouping aggregations, they have used GROUP transform in java for selecting multiple fields, should we show how select in python can be used here for multiple fields?
Also has support for joins and complex joins been developed yet for python and GO?
@TheNeuralBit in section 6.6.2 of programming guide under grouping aggregations, they have used GROUP transform in java for selecting multiple fields, should we show how select in python can be used here for multiple fields?
I think the GroupBy transform is the best analogue in Python. There are some usage examples here
Also has support for joins and complex joins been developed yet for python and GO?
No neither Python nor Go has a high-level Join transform.
@TheNeuralBit In combine functions of aggregate fields, there is sum and meancombinefn, is there any other I should know of?
@TheNeuralBit In combine functions of aggregate fields, there is sum and meancombinefn, is there any other I should know of?
Any CombineFn implementation, like you would use in the Combine transform, can be used to aggregate fields.
@TheNeuralBit I added some python code for select and groupBy transform but it doesn't show up while selecting python, instead shows up after selecting java, could you identify where might have I gone wrong ?
@TheNeuralBit what changes should I do next?
R: @TheNeuralBit
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control
retest this please
the console output of the whitespace check shows this
but there is no blank line or any whitespace according me in