beam icon indicating copy to clipboard operation
beam copied to clipboard

[Feature Request]: Develop a way to evolve the construction schema used by transforms safely

Open chamikaramj opened this issue 1 year ago • 4 comments

What would you like to happen?

We ran into several regressions due to updates to construction schema of I/O transforms breaking transform upgrade via TransformService (from older Beam versions).

The pattern is: (1) We add a new field/property to the I/O transform (2) We update the schema in the corresponding IOTranslation class (for example, [1] [2]) (3) We update the schema.

If we do not take special mitigation actions, this might result in breakages during upgrade. This is because the Row objects we try to parse might actually come from older Beam versions where the new field does not exist.

Example regressions/fixes: [3] [4]

Potential strategies to prevent similar issues in the future: (1) Develop utils to safely migrate the schema (2) More integration/compat tests (3) Move to proto

[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java [2] https://github.com/apache/beam/blob/911c525348b81afd48a414018ba579c3b78e3262/sdks/java/io/kafka/upgrade/src/main/java/org/apache/beam/sdk/io/kafka/upgrade/KafkaIOTranslation.java#L70 [3] https://github.com/apache/beam/pull/30551 [4]https://github.com/apache/beam/pull/31685

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • [ ] Component: Python SDK
  • [X] Component: Java SDK
  • [ ] Component: Go SDK
  • [ ] Component: Typescript SDK
  • [ ] Component: IO connector
  • [ ] Component: Beam YAML
  • [ ] Component: Beam examples
  • [ ] Component: Beam playground
  • [ ] Component: Beam katas
  • [ ] Component: Website
  • [ ] Component: Spark Runner
  • [ ] Component: Flink Runner
  • [ ] Component: Samza Runner
  • [ ] Component: Twister2 Runner
  • [ ] Component: Hazelcast Jet Runner
  • [ ] Component: Google Cloud Dataflow Runner

chamikaramj avatar Jun 25 '24 21:06 chamikaramj

We might run into similar issues when we try to migrate the schema for SchemaCoder I think (for streaming update operations).

chamikaramj avatar Jun 25 '24 21:06 chamikaramj

cc: @kennknowles @ahmedabu98 @robertwb

chamikaramj avatar Jun 25 '24 21:06 chamikaramj

This could even impact jobs just running as cronjobs, right?

kennknowles avatar Jun 25 '24 21:06 kennknowles

You mean as opposed to streaming updates ? If so yes. This gets hit when a pipeline tries to upgrade a single transform to a new Beam version (via TransformService for example).

https://beam.apache.org/documentation/programming-guide/#transform-service-usage-upgrade

chamikaramj avatar Jun 25 '24 22:06 chamikaramj