piped-processing-language
piped-processing-language copied to clipboard
[RFC] Unified PPL Data Type
Is your feature request related to a problem?
Current State: Fragmented Data Type Systems in PPL Engines
Query engines such as OpenSearch PPL and Spark PPL employ distinct data type systems, creating interoperability challenges in multi-engine environments. Key examples include:
- Type Name Mismatches: OpenSearch PPL defines INTEGER (string representation: integer).Spark PPL uses IntegerType (string representation: int). Despite representing semantically equivalent 32-bit signed integers, the syntactic inconsistency disrupts cross-engine workflows.
- Engine-Specific Types: OpenSearch PPL introduces specialized types like IP and GEO_POINT, which lack native equivalents in other engines.
Impact:
- Integration Issues: Tools like OpenSearch Dashboards face parsing errors or misaligned visualizations when processing results from engines with mismatched type systems.
- Manual Overhead: Users must rewrite queries or cast types explicitly when migrating between engine
What solution would you like?
To eliminate friction and ensure seamless interoperability, all PPL-compliant engines should adopt a common data type system with the following principles:
- Standardized Type Names, Universal type names and string representations (e.g., int instead of INTEGER or IntegerType).
- Semantic Consistency, Equivalent types (e.g., 32-bit integers) must behave identically in syntax, casting rules, and operations (e.g., arithmetic, comparisons). Engine-specific types (e.g., ip, geo_point) should be opt-in extensions with clear documentation.
- Interoperability Guarantee Queries and schemas written for one engine should execute seamlessly on others without manual adjustments.
Do you have any additional context?
- ZetaSQL common data type. https://github.com/google/zetasql/blob/master/docs/data-types.md
- OpenSearch PPL data type. https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/general/datatypes.rst