iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Discussion: Rethink `PrimitiveLiteral`.

Open liurenjie1024 opened this issue 1 year ago • 12 comments

In our discussion in #77, we decoupled type info from struct values. I'm also facing some challenge when implementing #153. In short, the PrimitiveLiteral is kind of weird:

  1. Sometimes it's self contained, for simple types such as int, date, string.
  2. Sometimes it's not self contained, for example decimal.

Since we have decided to decouple type info from actual values, I think we should further do it to primitive types. But sometimes it would be quite useful so associated type info with actuals, for example when we want to allow user to construct an expression a < 3.25, where 3.25 is a decimal.

I would suggest to do following refactoring:

  1. Further simplify PrimitiveLiteral to have only several physical variants: bool, i32, i64, i128, float, double, bytes, utf8. It's mainly used as in memory data structures for things like manifest, data files, etc.

  2. Add a new type Datum which is self contained primitive value, it can be implemented as following:

pub enum Datum {
   bool(bool),
    int(i32),
    ...
    datetime(DateTime<Utc>),
    decimal(Decimal),
}

Datum is mainly used in user facing apis, such as expression construction.

Alternatives

  1. Should we implement PrimitiveLiteral just as Vec<u8>?

It's possible, but I don't think it's a good idea. By using physical types, we can avoid a lot unsafe codes, and make things easier to maintain.

liurenjie1024 avatar Jan 09 '24 07:01 liurenjie1024

cc @Xuanwo @JanKaul @Fokko @ZENOTME

liurenjie1024 avatar Jan 09 '24 10:01 liurenjie1024

Add a new type Datum which is self contained primitive value, it can be implemented as following

This idea looks good to me. But I'm concerned about the conversion effort for two representations. e.g. The user uses expr Decimal(Decimal), and the partition value in the data file is Decimal(i128). Then we should make an effort to convert them and compare them. And we also should deal with the case that if Decimal(Decimal) is i256.

The Datum seems like Literal with type info, so I think a new representation🤔:

struct Datum {
  ty: DataType,
  value: Literal 
}

The benefit of this may be less compatible effort. To make user-facing APIs more simplified, maybe we can let the user provide Decimal and convert it into Literal + DataType in the inner.

ZENOTME avatar Jan 09 '24 11:01 ZENOTME

Sometimes it's self contained, for simple types such as int, date, string. Sometimes it's not self contained, for example decimal.

What do you mean by self-contained? For example, why is a date self-contained, and a decimal not?

Let me elaborate on how we do things in PyIceberg, because I think the approach we have there is quite similar, but I don't think we need to have two distinct types of literals. I just want to make sure that we make the right decision here, because this will be hard to undo later on. Please bear with me here, since Rust is not OOP, I might not be able to comprehend the whole issue here 😛 But I'm sure that this might help as some food for thought and help us to find the right solution.

Within PyIceberg we have Literal ABC that has all the primitive (str, int, bytes, float, double, decimal, etc), but also the logical ones (date, uuid, etc). When you first do the expression a < 3.25 we have the function literal(val: Any) that's called by the parser. This will result in LessThan(Ref('a'), FloatLiteral(3.25)) in this case because that's the representation in Python (type(3.25) => <class 'float'>). But the point is that we don't care at this moment because we don't know yet the type of the schema. Once we start planning the query, we'll look up the type of Ref('a') in the schema, and we find out that it is a Decimal. What we do then, is FloatLiteral(3.25).to(DecimalType()) and we try to parse it to a decimal. If it fails, we'll raise an exception.

I think this goes back to an earlier discussion:

pub enum Datum {
   bool(bool),
    int(i32),
    ...
    datetime(DateTime<Utc>),
    decimal(Decimal),
}

Internally, the DateTime in Iceberg is milliseconds since epoch. And I think that also should be the internal representation for simplicity and performance reasons. What I would suggest is to have a similar structure where you can just accept a DateTime<Utc>, i32, i64 that later can be mapped to the actual type. This way you give the user the freedom to use the type that they like, but also the flexibility to use the same structure when you parse SQL (where you can just assume that everything is a string, and you'll parse it to the correct type when binding).

And we also should deal with the case that if Decimal(Decimal) is i256.

With the approach mentioned above, I don't think this is a problem, since we know the size of the Decimal when binding to the type (we just take the size that's appropriate for the precision of the type). Looking ahead, also nanosecond timestamps are being added with just the physical type i64, but there is a fair chance that i128 might be added later.

Another similar situation that tackles this approach is schema evolution. Let's say that at some point you promote a float to double field. In that case, you want to bind to the file-schema, this way you'll convert the literal to a float when you have older datafiles where the field is still a float, and you'll convert it to a double in case the datafiles have been written with the new schema. Fun thing, if you then evaluate float.max + 1 against a float field, you can just return AlwaysFalse() without having to open the files at all :)

I hope this helps, and let me know if there are any other questions, of if you like me to elaborate on something in particular.

Fokko avatar Jan 09 '24 16:01 Fokko

What do you mean by self-contained? For example, why is a date self-contained, and a decimal not?

Sorry for the unclear description. Currently the decimal value is stored in an Decimal(i128), that means we lost the scale in type. (Think about the case we want to print it for error reporting, we can't correctly print it, but it's not the case the Date(i32).

In rust api, the Literal doesn't contains complete type info. Compared with python/java api, Literal is used in more places, for example, in the upper/lower bounds of partition field summary, rust uses Literal rather than bytes. This way we can do some check when manipulating and avoid unsafe codes.

But when implementing expressions, I realized that it's sometimes convenient and user friendly to associate type info with raw literal value. For example, when we need to print unbound expression.

I think this goes back to an earlier discussion:

pub enum Datum { bool(bool), int(i32), ... datetime(DateTime<Utc>), decimal(Decimal), }

Yes, but this Datum serves for different purpose, it's mainly user facing api to construct expression.

The Datum seems like Literal with type info, so I think a new representation🤔:

struct Datum { ty: DataType, value: Literal }

This sounds a good idea to me, which doesn't introduce extra maintain effort. The only concern is that it maybe not quite user friendly, but it seems that we can add some api to make it easier to use.

liurenjie1024 avatar Jan 10 '24 03:01 liurenjie1024

Following @Fokko's reasoning, Decimal is comparable to TimestampZ where the timezone is stored in the type. Similarly the scale of the Decimal is stored in the type.

I think it makes sense to think about the use cases for Literal. It is used for partition values and default values. Both require only the physical representation. The scale is actually not needed and Literal(i128) would suffice for these use cases.

@liurenjie1024 mentioned error messages as another use cases. That's the only time that the i128 representation might not be suitable. The question is whether the error messages warrant a more complex implementation.

Regarding @Fokko's example: Doesn't initially storing the Decimal as a LiteralFloat loose accuracy because the 3.25 is stored as something like 3.24999999987. If you then convert it to Decimal, it's inaccurate. Maybe you could use PrimitiveLiteral::String here.

JanKaul avatar Jan 10 '24 10:01 JanKaul

@liurenjie1024 mentioned error messages as another use cases. That's the only time that the i128 representation might not be suitable. The question is whether the error messages warrant a more complex implementation.

Regarding @Fokko's example: Doesn't initially storing the Decimal as a LiteralFloat loose accuracy because the 3.25 is stored as something like 3.24999999987. If you then convert it to Decimal, it's inaccurate. Maybe you could use PrimitiveLiteral::String here.

Error message is just an example, not all use cases. For example when we convert unbound expression to bound expression, how do we know its original scale?

String is enough for storing decimal, or everything, but it maybe weird in api, since with only a string we don't know its original type, e.g. user may write an unbound expression like a < "3.23", where a is a decimal and it's legal to compare it with string.

I do admit that introducing another enum will be difficult to maintain, maybe the solution suggested by @ZENOTME is great:

struct Datum {
  typ: PrimitiveType,
  literal: PrimitiveLiteral
}

cc @Fokko @JanKaul @ZENOTME @Xuanwo What do you think?

liurenjie1024 avatar Jan 11 '24 01:01 liurenjie1024

First of all, thanks for the explanation of the Literal.

This sounds a good idea to me, which doesn't introduce extra maintain effort. The only concern is that it maybe not quite user friendly, but it seems that we can add some api to make it easier to use.

I think the user should not care about the type of the field. We have the information once we bind to the schema. In extend to that, it can also be that a field has evolved from an int to a long, so you have to convert it anyway based om the file schema.

Regarding @Fokko's example: Doesn't initially storing the Decimal as a LiteralFloat loose accuracy because the 3.25 is stored as something like 3.24999999987. If you then convert it to Decimal, it's inaccurate. Maybe you could use PrimitiveLiteral::String here.

This should not happen in practice, because you would only do safe conversions. For example, the expression a > 3.25 should parse to an expression that has enough precision to capture the original value. If a would be of type float, then you would convert the literal to a float one once you bind to the schema.

Error message is just an example, not all use cases. For example when we convert unbound expression to bound expression, how do we know its original scale?

Also in the case of evolution, the precision can be widened. When we bind the decimal to the actual schema, we'll set the scale using set_scale.

String is enough for storing decimal, or everything, but it maybe weird in api, since with only a string we don't know its original type, e.g. user may write an unbound expression like a < "3.23", where a is a decimal and it's legal to compare it with string.

SQL is a perfect example of that API I would say. Once you resolve the type of a < "3.23", you'll call i128::from_str("3.23").

My suggestion:

struct BoundLiteral {
  typ: PrimitiveType,
  literal: PrimitiveLiteral
}

And it is that one is not part of the public facing API :)

Fokko avatar Jan 11 '24 20:01 Fokko

I think we have reached consensus that storing a BoundLiteral in UnboundExpression is acceptable. About the visibility of this struct, I would argue that we still need to expose it user, for following reason:

  1. When user construction UnboundExpression, it's not always from sql. For example when we integrate it with other sql engines, it may push filter to iceberg api to construct UnboundExpression.
  2. The parse string method implemented in pyiceberg is not a typical approach in rust. Rust has elegant support for macros, which is efficient and type safe.

So my proposal is that we still make BoundLiteral public, but we should be careful not exposing its internals in public api. For the first version, we'll only expose builder methods so that users can create it easily, and no internals will be exposed.

WDYT? cc @Fokko @ZENOTME @Xuanwo

liurenjie1024 avatar Jan 12 '24 07:01 liurenjie1024

When user construction UnboundExpression, it's not always from sql. For example when we integrate it with other sql engines, it may push filter to iceberg api to construct UnboundExpression. The parse string method implemented in pyiceberg is not a typical approach in rust. Rust has elegant support for macros, which is efficient and type safe.

This case looks reasonable to me.

So my proposal is that we still make BoundLiteral public, but we should be careful not exposing its internals in public api. For the first version, we'll only expose builder methods so that users can create it easily, and no internals will be exposed.

Does this proposal look like following?

let literal = BoundLiteral::builder.with_literal(...).build();
TableScan scan = scan.filter(Expressions.equal("id", literal));

ZENOTME avatar Jan 12 '24 07:01 ZENOTME

Does this proposal look like following?

let literal = BoundLiteral::builder.with_literal(...).build(); TableScan scan = scan.filter(Expressions.equal("id", literal));

Yes.

liurenjie1024 avatar Jan 12 '24 07:01 liurenjie1024

  1. The parse string method implemented in pyiceberg is not a typical approach in rust. Rust has elegant support for macros, which is efficient and type safe.

This was more of an example that you don't know the types upfront.

So my proposal is that we still make BoundLiteral public, but we should be careful not exposing its internals in public api. For the first version, we'll only expose builder methods so that users can create it easily, and no internals will be exposed.

I don't think the BoundExpression is something that a user would ever use as it is internal.

let literal = BoundLiteral::builder.with_literal(...).build();
TableScan scan = scan.filter(Expressions.equal("id", literal));

I think we shouldn't it call BoundLiteral then because it hasn't been bound, right? The binding will happen when you do an action. If you in the scan select a different snapshot or branch, the field that it points to could be different (hopefully not, because that would be confusing). For the public API, I would expect:

let literal = Literal::uuid_from_str("a1a2a3a4-b1b2-c1c2-d1d2-d3d4d5d6d7d8").unwrap();
TableScan scan = scan.filter(Expressions.equal("id", literal));

Once we know the schema, and we perform an action, we'll bind to it:

TableScan scan = scan.filter(Expressions.equal("id", literal)).snapshot(id=123).to_df();

Fokko avatar Jan 12 '24 21:01 Fokko

I don't think the BoundExpression is something that a user would ever use as it is internal.

let literal = BoundLiteral::builder.with_literal(...).build(); TableScan scan = scan.filter(Expressions.equal("id", literal));

I think we share same concern and idea about the api design, but I failed to clarify the problem🤪.

I'll submit a pr and let's discuss it in pr, which helps to clarify the problem.

liurenjie1024 avatar Jan 13 '24 11:01 liurenjie1024