datafusion feat(logical-types): add NativeType and LogicalType

This PR is part of #12622. It introduces the notion of DataFusion NativeType and the LogicalType trait.

The changes stem from various discussions of #11513 #12644 #12635 and other. The goal of this PR is to unify those discussions into a concrete code proposal to be merged to main [1] in order to unblock work on #12644 #12635.

[1] as opposed to the logical-types branch used currently for experimenting #12622

Oct 10 '24 14:10 notfilippo

cc @sadboy @milevin @wolfram-s @wizardxz

Oct 14 '24 07:10 findepi

@alamb @andygrove @comphead please take a look

Oct 18 '24 11:10 findepi

I dont remember the roots, so wondering, can we investigate and use a single type system which is Arrow Types and get rid of other types. At the end of the day it is based on Arrow Types.

Oct 18 '24 16:10 comphead

dont remember the roots, so wondering, can we investigate and use a single type system which is Arrow Types and get rid of other types. At the end of the day it is based on Arrow Types.

I can try to summarise the discussion that has happened in the last months regarding this proposal. Currently DataFusion:

Doesn’t support extension types, both in the logical sense (e.g. JSON) and in the physical sense (e.g. a logical string natively is Utf8, LargeUtf8, Utf8View but ideally a user might want to also define a special physical type like List(u8) to be logically equivalent to a string)
Has a lot of redundant code to handle logically equivalent values during logical planning (i.e. ScalarValue, function signatures)
Doesn’t support any form of runtime adaptability and assumes that all record batches as input are of the same exact schema while potentially having all the infrastructure needed to be able to support late coercion of logically equivalent values during physical execution (which seems to be a problem that comet is also looking to solve)

while “arrow datatype everywhere” is definitely working for DataFusion currently, my opinion is that this is a needed step towards extensibility and it will help enterprise users looking to migrate their existing engine, custom file format and types to DataFusion, efficiently.

Oct 19 '24 22:10 notfilippo

cc @goldmedal @ozankabak maybe you can take a look at this PR as well if you have bandwidth

Oct 24 '24 12:10 notfilippo

Has the logical type branch reached its final state, and are we starting the migration?

Oct 24 '24 12:10 berkaysynnada

Has the logical type branch reached its final state, and are we starting the migration?

Not really. But it seems like this part of the change is already needed to unblock other tasks.

Oct 24 '24 12:10 notfilippo

I am traveling currently but will have some bandwidth to take a look early next week if noone beats me to it

Oct 24 '24 13:10 ozankabak

I might take a look at it this weekend.

Oct 24 '24 14:10 goldmedal

@notfilippo I think we need a way to know the physical types that we are able to casted to given the Logical type.

For example, if we have signature which expect Logical::String, and it accepts not only Utf8, LargeUtf8, Utf8View string family but also numeric int, we need to cast the int to string so we need to know the physical types given a logical type. We don't know which DataType we should cast to if we only know there is LogicalType::String

Therefore, I think we need to add another function that returns the physical types for LogicalType. Note that the order matters if we would like the cast to Utf8View, LargeUtf8 and then Utf8 in order.

Something like

    fn decode_types(&self) -> Vec<DataType> {
        vec![DataType::Utf8View, DataType::LargeUtf8, DataType::Utf8]
    }

postgres=# select concat('string', 123);
  concat   
-----------
 string123
(1 row)

is one of the example in postgres that cast a non-string to the stirng

What do you think?

Oct 25 '24 05:10 jayzhan211

@jayzhan211 I think this approach goes against the problems that logical types try to solve. Specifically I think that we could (and should) define some kind of casting rules for a logical types (NativeType::* -> String, Int* <-> Float*, JSON -> String) via a method to be implemented in the LogicalType trait (can_cast_to(&self, rhs: &dyn LogicalType)). Then during physical planning when a function that takes a Logical string receives in input a Logical Int8 it can only check if the logical cast is valid. During logical planning (and/or execution if we implement runtime adaptiveness) you would have instead type information for all the function parameters and you could decide to which type it makes more sense to cast.

Oct 25 '24 07:10 notfilippo

During logical planning (and/or execution if we implement runtime adaptiveness) you would have instead type information for all the function parameters and you could decide to which type it makes more sense to cast.

I'm not clear about this part. I assume that we only have Logical::Int64 and DataType::Int64 as the function input, and we also know the the function accepts Logical::String. Is there any other information we know? Otherwise, I think we have a question that which DataType::Utf8 or DataType::Utf8View we need to cast to.

Oct 25 '24 07:10 jayzhan211

Otherwise, I think we have a question that which DataType::Utf8 or DataType::Utf8View we need to cast to.

Ah! Now I understand the question. I think it makes sense to me that if there are no hints in the environment that suggest what specific type we should cast to then there is no value in choosing a specific cast type and we could fallback to some generic mapping e.g. casting to Native::String -> casting to DataType::Utf8. Not sure if there is any added benefit of letting the logical type choose it's sequence of casts instead of providing just one fallback type but maybe I'm missing something.

Currently in DataFusion if a signature accepts DataType::Utf8, LargeUtf8 and Utf8View and we receive DataType::Int64 what do we do? And what if the function accepts DataType::Utf8, LargeUtf8 but not Utf8View?

Oct 25 '24 08:10 notfilippo

Not sure if there is any added benefit of letting the logical type choose it's sequence of casts instead of providing just one fallback type but maybe I'm missing something.

~I have no string opinion on this too, but since the mapping is strongly related to LogicalType itself, for the point of customization, I think it makes sense to let the user define the types and the preference they want. Although this maybe a premature optimization that no one really use it~. I think user defined types may need this, since we don't know what is the backing native type of it, so we can't build up a default map for them

Currently in DataFusion if a signature accepts DataType::Utf8, LargeUtf8 and Utf8View and we receive DataType::Int64 what do we do? And what if the function accepts DataType::Utf8, LargeUtf8 but not Utf8View?

The order is Utf8View -> LargeUtf8 -> Utf8. I think most of the function accepts those 3 together, some doesn't support Utf8View yet, but it is just not yet implemeneted.

Oct 25 '24 08:10 jayzhan211

since we don't know what is the backing native type of it

Isn't this the goal on having native() on every LogicalType (even user defined ones)? Anyway we could define a default mapping for NativeTypes and let user defined types override that behaviour via the trait.

Something like:

pub trait LogicalType: Sync + Send {
    fn native(&self) -> &NativeType;
    fn signature(&self) -> TypeSignature<'_>;
    
    fn default_cast(&self) -> &[DataType] {
        self.native().default_cast()
    }
}

and then:

impl LogicalType for NativeType {
    fn native(&self) -> &NativeType {
        self
    }

    fn signature(&self) -> TypeSignature<'_> {
        TypeSignature::Native(self)
    }

    fn default_cast(&self) -> &[DataType] {
        use NativeType::*;
        match self {
            Utf8 => &[DataType::Utf8View, DataType::LargeUtf8, DataType::Utf8],
            _ => todo!()
        }
    }
}

But maybe to support the fields of container types (Lists, ...) we might need to use a Vec<DataType> instead of &[DataType]. Anyhow are we sure we need a list?

The order is Utf8View -> LargeUtf8 -> Utf8.

So this means that if you received an Int64 then you first try to coerce to Utf8View and in what cases would you fall back to the LargeUtf8 and Utf8?

P.S. Thanks Jay for taking the time to reply to this, I really appreciate it 😊

Oct 25 '24 09:10 notfilippo

So this means that if you received an Int64 then you first try to coerce to Utf8View and in what cases would you fall back to the LargeUtf8 and Utf8?

Good question, maybe we just need a single type and cast to that one. If we can't, it probably mean they don't belong to the same logical type.

But that also means we will always cast to the large type like largeutf8 or largeutf8view?( I think we will have one in the future)

Oct 25 '24 10:10 jayzhan211

But that also means we will always cast to the large type like largeutf8 or largeutf8view?( I think we will have one in the future)

Yeah, for this particular edge case that is the only safe and sensible option I can think of.

edit: we could also use something like this

fn default_cast_for(&self, origin: &DataType) -> DataType {
      match (self, origin) {
          (Self::Utf8, DataType::Binary) => DataType::Utf8,
          (Self::Utf8, DataType::LargeBinary) => DataType::LargeUtf8,
          // Others...
          (Self::Utf8, _) => DataType::LargeUtf8,
          _ => todo!()
      }
  }

Essentially meaning that if we want to cast a Logical Int8 to a Logical Utf8 the best DataType we can use is Utf8. I can probably work a little more on the naming but the idea should be similar.

Oct 25 '24 11:10 notfilippo

@jayzhan211 I have a working prototype of the ideas we discussed above.

Oct 25 '24 14:10 notfilippo

I have no further concern so far, we can always add more function or modify it when we meet the use case.

Thank you for reviewing and helping this project along @jayzhan211 🙏

Oct 26 '24 11:10 alamb

Thanks a lot for the review @jayzhan211 . I've added all the suggestion mentioned :)

Oct 26 '24 16:10 notfilippo

Shall we merge this PR?

Oct 29 '24 11:10 alamb

Shall we merge this PR?

I think there is enough agreement to start working on the other pieces of the puzzle.

Oct 29 '24 12:10 notfilippo

@findepi - what do you think?

Oct 29 '24 13:10 ozankabak

If there are no more comments on this PR, I think we can merge it tomorrow. @alamb @jayzhan211 @ozankabak - what do you think?

Oct 31 '24 16:10 goldmedal

I think this is ready to merge. Additional improvements can be made in a follow-up

Nov 03 '24 03:11 jayzhan211

Epic work!

Nov 05 '24 17:11 alamb