[Feature Request]: Develop a way to evolve the construction schema used by transforms safely
What would you like to happen?
We ran into several regressions due to updates to construction schema of I/O transforms breaking transform upgrade via TransformService (from older Beam versions).
The pattern is: (1) We add a new field/property to the I/O transform (2) We update the schema in the corresponding IOTranslation class (for example, [1] [2]) (3) We update the schema.
If we do not take special mitigation actions, this might result in breakages during upgrade. This is because the Row objects we try to parse might actually come from older Beam versions where the new field does not exist.
Example regressions/fixes: [3] [4]
Potential strategies to prevent similar issues in the future: (1) Develop utils to safely migrate the schema (2) More integration/compat tests (3) Move to proto
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java [2] https://github.com/apache/beam/blob/911c525348b81afd48a414018ba579c3b78e3262/sdks/java/io/kafka/upgrade/src/main/java/org/apache/beam/sdk/io/kafka/upgrade/KafkaIOTranslation.java#L70 [3] https://github.com/apache/beam/pull/30551 [4]https://github.com/apache/beam/pull/31685
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
- [ ] Component: Python SDK
- [X] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
We might run into similar issues when we try to migrate the schema for SchemaCoder I think (for streaming update operations).
cc: @kennknowles @ahmedabu98 @robertwb
This could even impact jobs just running as cronjobs, right?
You mean as opposed to streaming updates ? If so yes. This gets hit when a pipeline tries to upgrade a single transform to a new Beam version (via TransformService for example).
https://beam.apache.org/documentation/programming-guide/#transform-service-usage-upgrade