beam icon indicating copy to clipboard operation
beam copied to clipboard

Generate external transform wrappers using a script

Open ahmedabu98 opened this issue 1 year ago • 22 comments

Implementing a script that generates wrappers for external SchemaTransforms, according to Option #3 in the following design doc: https://s.apache.org/autogen-wrappers

The script's workflow takes place in setup.py, which can be invoked for local setup or for building the SDK for a Beam release. Files are generated in a subdirectory that is ignored by git. From there, the wrappers can be imported into relevant __init__.py files.

Wrappers are generated along with any documentation provided from the underlying SchemaTransform, and is in compliance with existing linting and pydoc rules.

Includes documentation for how to use the script. Includes unit tests for different parts of the script.

Also adding a gradle command ./gradlew generateExternalTransformWrappers to build the configs and generate wrappers from scratch. This takes care of building the relevant expansion jars beforehand.

Also adding a PreCommit that generates the transform config from scratch and compares it with the existing. This serves to indicate whether a change will render the existing config out of sync. To resolve, one would re-generate the config (with ./gradlew generateExternalTransformWrappers) and commit the changes.

With the expansion service YAML config, the following transform YAML config is generated:

Transform YAML config

- default_service: sdks:java:io:expansion-service:shadowJar
  description: 'Outputs a PCollection of Beam Rows, each containing a single INT64
    number called "value". The count is produced from the given "start"value and either
    up to the given "end" or until 2^63 - 1.

    To produce an unbounded PCollection, simply do not specify an "end" value. Unbounded
    sequences can specify a "rate" for output elements.

    In all cases, the sequence of numbers is generated in parallel, so there is no
    inherent ordering between the generated values'
  destinations:
    python: apache_beam/io
  fields:
    end:
      description: The maximum number to generate (exclusive). Will be an unbounded
        sequence if left unspecified.
      nullable: true
      type: numpy.int64
    rate:
      description: Specifies the rate to generate a given number of elements per a
        given number of seconds. Applicable only to unbounded sequences.
      nullable: true
      type: Row(seconds=typing.Union[numpy.int64, NoneType], elements=<class 'numpy.int64'>)
    start:
      description: The minimum number to generate (inclusive).
      nullable: false
      type: numpy.int64
  identifier: beam:schematransform:org.apache.beam:generate_sequence:v1
  name: GenerateSequence

From this config, external transform wrappers are created and stored in appropriate modules. For example, our config gives us the following apache_beam/transforms/xlang/io.py file:

GenerateSequence wrapper
# NOTE: This file contains autogenerated external transform(s)
# and should not be edited by hand.
# Refer to gen_xlang_wrappers.py for more info.

"""Cross-language transforms in this module can be imported from the
:py:mod:`apache_beam.io` package."""

# pylint:disable=line-too-long

from apache_beam.transforms.external import BeamJarExpansionService
from apache_beam.transforms.external_transform_provider import ExternalTransform


class GenerateSequence(ExternalTransform):
  """
  Outputs a PCollection of Beam Rows, each containing a single INT64 number
  called "value". The count is produced from the given "start" value and either
  up to the given "end" or until 2^63 - 1.
  To produce an unbounded PCollection, simply do not specify an "end" value.
  Unbounded sequences can specify a "rate" for output elements.
  In all cases, the sequence of numbers is generated in parallel, so there is no
  inherent ordering between the generated values
  """
  identifier = "beam:schematransform:org.apache.beam:generate_sequence:v1"

  def __init__(self, start, end=None, rate=None, expansion_service=None):
    """
    :param start: (numpy.int64)
      The minimum number to generate (inclusive). 
    :param end: (numpy.int64)
      The maximum number to generate (exclusive). Will be an unbounded
      sequence if left unspecified. 
    :param rate: (Row(seconds=typing.Union[numpy.int64, NoneType], elements=<class 'numpy.int64'>))
      Specifies the rate to generate a given number of elements per a given
      number of seconds. Applicable only to unbounded sequences. 
    """
    self.default_expansion_service = BeamJarExpansionService(
        "sdks:java:io:expansion-service:shadowJar")
    super().__init__(
        start=start, end=end, rate=rate, expansion_service=expansion_service)

ahmedabu98 avatar Dec 20 '23 12:12 ahmedabu98

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label python. R: @damccorm for label build.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar Dec 22 '23 01:12 github-actions[bot]

R: @robertwb @tvalentyn @chamikaramj

(manually requesting to make the review bot happy)

damccorm avatar Dec 26 '23 14:12 damccorm

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

github-actions[bot] avatar Dec 26 '23 14:12 github-actions[bot]

friendly ping @chamikaramj @robertwb

ahmedabu98 avatar Jan 08 '24 15:01 ahmedabu98

Thanks for looking @chamikaramj. I addressed these comments, PTAL

ahmedabu98 avatar Jan 17 '24 19:01 ahmedabu98

You can make the generated transforms more discoverable by importing them in __init.py__ files. that might generating an additional file(s) _external_transforms.py, and importing _external_transforms.py in __init__.py . That would make it easier for users to reference particular transforms without importing them from _et modules.

tvalentyn avatar Jan 24 '24 02:01 tvalentyn

So would users be expected to import these transforms from the _et paths directly? One transform per module (which is not how the others are typically arranged, e.g. with the Read and the Write in the same place)? I am also a bit wary of scattering these files all over the place (for both tooling and human reasons).

What if instead they were in a single subdirectory (like with protos) and on import this would then dynamically insert any specifically placed transforms into the right modules (which could include modules that already exist)?

robertwb avatar Jan 25 '24 01:01 robertwb

What if instead they were in a single subdirectory (like with protos)

I actually also thought about it and hesitated to suggest only since it felt a bit late in the game to do so.

So would users be expected to import these transforms from the _et paths directly

Having generated files in one directory would also require updating / maintaining only one __init__.py file to facilitate the imports.

tvalentyn avatar Jan 25 '24 23:01 tvalentyn

This seems to be under fairly active work, and doesn't feel quite like a release blocker (new functionality, not a regression, not a bug fix etc).

Is there a compelling reason this should be in the 2.54.0 release instead of getting more time to bake and be in the 2.55.0 release instead? (There's still time, as I have other cherry picks to make tommorrow, and other issues to triage).

lostluck avatar Jan 26 '24 00:01 lostluck

Thank you @lostluck, yes best to let it bake and get it in the next release

ahmedabu98 avatar Jan 26 '24 12:01 ahmedabu98

What if instead they were in a single subdirectory (like with protos)

I like this alternative more actually. I've made some changes to instead generate wrappers inside the apache_beam.transforms._external_transforms package. Inside there, each destination gets its own file (e.g. transforms with destination apache_beam/io will be written to apache_beam/transforms/_external_transforms/io.py).

Then we can make these transforms available to the apache_beam/io package by importing everything from _external_transforms/io.py.

@robertwb @tvalentyn let me know if this is what you had in mind

Having generated files in one directory would also require updating / maintaining only one init.py file to facilitate the imports.

Yeah I'd rather do this manually because I'm wary about altering existing files with a script.

I actually also thought about it and hesitated to suggest only since it felt a bit late in the game to do so.

Not too late! The feedback helps to get this in a good shape before we release it to users

ahmedabu98 avatar Jan 30 '24 16:01 ahmedabu98

Successful PreCommit run: https://github.com/ahmedabu98/beam/actions/runs/7714357594/job/21026416076

ahmedabu98 avatar Jan 30 '24 16:01 ahmedabu98

R: @tvalentyn R: @robertwb

Ready for another review!

ahmedabu98 avatar Feb 06 '24 21:02 ahmedabu98

i think my comments were addressed. LGTM from my side as long as tests pass.

tvalentyn avatar Feb 09 '24 00:02 tvalentyn

actually, i tried running gradlew generateExternalTransformWrappers in a clean environment and it failed.

I think the intent of that command should be to regenerate the yaml portion only, but not generate the python wrappers. Wrappers should be generated from the yaml config, when setup.py is called.

tvalentyn avatar Feb 09 '24 00:02 tvalentyn

you could consider extracting the portion that generates the yaml into a separate script. If you keep the script in one file that can be fine too, but then generating the yaml portion (from gradle) shouldn't require installing jinja.

tvalentyn avatar Feb 09 '24 00:02 tvalentyn

Running ./gradlew generateExternalTransformWrappers fails on my side because of this line in apache_beam.io: from apache_beam.transforms.xlang.io import *. It fails because the script initially cleans up the generated modules, then imports apache_beam to rediscover + regenerate the config. At import though, apache_beam.transforms.xlang.io doesn't exist because it's been previously cleaned up.

I think we need to put the import back in the try:, except: pass block?

ahmedabu98 avatar Feb 12 '24 18:02 ahmedabu98

you could consider extracting the portion that generates the yaml

Makes sense to have the gradle command only generate the config. Then each remote SDK can have its own command to generate wrappers from there (in the future we can sense to combine them all in one command if it makes sense).

We can still have the script in one file, I'll just move imports around to the relevant portions.

ahmedabu98 avatar Feb 12 '24 18:02 ahmedabu98

Ah, yes, that's right. Hopefully the file generation itself isn't that slow.

On Mon, Feb 12, 2024 at 3:02 PM Ahmed Abualsaud @.***> wrote:

@.**** commented on this pull request.

In sdks/python/setup.py https://github.com/apache/beam/pull/29834#discussion_r1486898981:

  •  # if exists, this directory will have at least its __init__.py file
    
  •  if (not os.path.exists(generated_transforms_dir) or
    
  •          len(os.listdir(generated_transforms_dir)) <= 1):
    
  •    message = 'External transform wrappers have not been generated '
    
  •    if not script_exists:
    
  •      message += 'and the generation script `gen_xlang_wrappers.py`'
    
  •    if not config_exists:
    
  •      message += 'and the standard external transforms config'
    
  •    message += ' could not be found'
    
  •    warnings.warn(message)
    
  •  else:
    
  •    warnings.warn(
    
  •        'Skipping external transform wrapper generation as they '
    
  •        'are already generated.')
    
  •  return
    
  • out = subprocess.run([

This will make pretty much every run of setup.py slow, right?

Not anymore, since we decided to decouple the generation steps. setup.py expects the transform config to already exist, so no expansion + discovery is done here. All it does is generate files based on the existing config.

— Reply to this email directly, view it on GitHub https://github.com/apache/beam/pull/29834#discussion_r1486898981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADWVAJPRS6PTJHJGTPTHE3YTKNPZAVCNFSM6AAAAABA4ZG2QSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNZWGUYTSNZSGA . You are receiving this because you were mentioned.Message ID: @.***>

robertwb avatar Feb 12 '24 23:02 robertwb

I think we need to put the import back in the try:, except: pass block?

That will work, alternatively, you can have dedicated functions: to generate config, and to generate wrappers, and make necessary imports only in functions where imports are required; you might have to add # pylint: disable=g-import-not-at-top.

tvalentyn avatar Feb 13 '24 23:02 tvalentyn

@tvalentyn this is already the case, the steps are in different functions. e.g. we're only importing apache_beam when using the function to generate transform configs. But this function is unusable in a clean setup (one that doesn't have generated modules) unless we make imports like from apache_beam.transforms.xlang.io import * optional.

ahmedabu98 avatar Feb 14 '24 15:02 ahmedabu98

The newly added precommit that performs a transform config sync check (beam_PreCommit_Xlang_Generated_Transforms) is running green on my fork: https://github.com/ahmedabu98/beam/actions/runs/7933231703/

Will merge after current tests pass

ahmedabu98 avatar Feb 16 '24 16:02 ahmedabu98