flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core Feature] Logical types: static type checking for higher level user defined types.

Open kumare3 opened this issue 3 years ago • 5 comments

Motivation: Why do you think this is important? Flytekit and in the future other SDK's support progressive typing and allowing users to define their types. The TypeTransformers today in flytekit, effectively result in type-erasure at runtime. The higher level types are converted to underlying flyte types and on retrieval the information of the source type is lost. This works in theory as the receiving sdk, has the right types defined. It also helps in easy type-casting types into all of its derivative types. This technique has been successfully deployed to a language like Java and the JVM.

Examples of type derivatives Convert from Spark Data frame -> Flyte.Schema -> Pandas data frame.

But, it is desirable to keep the source type available so that we can recover the type, even without explicitly requesting for this type.

Example: remote.get().outputs.x -> can be correctly casted if available

Moreover, one problem with type erasure is loss of static type checking across languages or different tasks.

To overcome this problem the issue proposes we introduce a new type called the LogicalType, which keeps information about the source and the transport type associated.

Goal: What should the final outcome look like, ideally? Users can specify new types, and we can reverse engineer those types from the stored definition. Helps in debugging, static type assertions, optimizations and helps extensibility

Describe alternatives you've considered What exists today - type erasure!

[Optional] Propose: Link/Inline OR Additional context -- from @kanterov Logical type is a type alias for an existing LiteralType, and values for logical types are represented with existing Literal. Logical types can correspond to built-in or user-defined types in SDK. A logical type is defined as (this approach is inspired by Apache Beam proto):

message LogicalType {
  // Required. Unique resource name for LogicalType.
  // There is a list of well-known logical types supported by SDKs, 
  // and users can add their own
  string urn = 1; 
  
  // Required. Existing LiteralType used to represent values of LogicalType
  LiteralType representation = 2;

  // Optional. Additional argument for logical type. May be used to serialize additional information
  Literal argument = 3;

  // Optional. Type of argument.
  LiteralType argument_type = 4;
}

Example of urn

pandas.DataFrame, pyspark.DataFrame

Semantics Type t1 is supertype of logical type t2, iff: t1 is strictly equal to t2 t1 is supertype of t3, and t3 is supertype of t2 t1 is supertype of t2.representation

This allows us to read unknown logical types using their representation. E.g. if task_1 produces output: LogicalType(representation=INTEGER) and task_2 has input of INTEGER, it’s possible to bind task_2.input to task_1.output. However, it isn’t possible to do the opposite: use any INTEGER as LogicalType(representation=INTEGER).

SDKs have a list of well-known logical types that are mapped to built-in or custom types. flyteconsole or flytectl can have a special behaviour for well-known logical types.

flytepropeller shouldn’t introduce a special behaviour for well-known logical types when doing type-checking. This limitation of logical types allows the introduction of new logical types without all components of Flyte being aware of it. When there is an unknown logical type, it should be safe for implementation to fallback to it’s representation.

Examples of well-known logical types

  • INT32 (represented as INTEGER)
  • FIXEDBYTES(N) (represented as BINARY): argument type is INTEGER, representing length of fixed byte array
  • LOCAL DATE (represented as DATETIME): date without timezone
  • LOCAL DATETIME (represented as DATETIME): datetime without timezone
  • DECIMAL(P, D) (represented as BYTES): argument_type is {p: INTEGER, d: INTEGER}, where p is precision, and d is the number of digits after decimal points)

Example: introducing INT32 flyteidl has an INTEGER type that is 64-bit integer. It’s natural for SDK users to use 32 bit integers unless they need 64 bits. In Java, there are two separate types: Integer and Long representing 32 and 64 bit integers. However, it creates a problem because a 32 bit integer can overflow when trying to fit 64 bits. Introducing logical type for INT32 allows tasks to read INT32, only if input is bound to a literal that is known to be INT32.

kumare3 avatar Aug 19 '21 20:08 kumare3