substrait icon indicating copy to clipboard operation
substrait copied to clipboard

If a precision timestamp literal is encountered there is no way to know the precision

Open westonpace opened this issue 11 months ago • 8 comments

Currently precision timestamp literals are defined as:

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
      uint64 precision_timestamp = 34;
      uint64 precision_timestamp_tz = 35;

However, there is no way to know if the precision is 6 or less because the literal doesn't have to be associated with a type. For example, consider:

SELECT * FROM my_table WHERE my_ts_col < timestamp (3) '2004-10-19 10:23:54.123'

This will turn into a plan with a literal value in the "less than" expression:

"scalarFunction":{
  "functionReference":3,
  "outputType":{..},
  "arguments":[
    { ... },
   ...
   "literal":{ "precision_timestamp": 1098181434123000 }

However, there is no way of knowing what the precision of that timestamp is. I could infer it by tracing back and realizing that the other value in the expression has precision X but:

  • That's a lot of work, and I don't have to do this for other literals
  • They don't technically have to be the same, depending on what range of functions the producer / consumer supports

westonpace avatar Mar 19 '24 19:03 westonpace

This behavior is identical to what happens with fixed_char, var_char, and decimal literals. The way this is addressed in Isthmus is to immediately cast the literal to the appropriate precision. For instance, a fixed character would be represented with the string "fruit" and then later is cast to the actual desired with such as fixed_char[20].

EpsilonPrime avatar Mar 21 '24 02:03 EpsilonPrime

First, I'm not sure that's true with varchar or decimal, the precision is part of the literal:

    message VarChar {
      string value = 1;
      uint32 length = 2;
    }
    message Decimal {
      // little-endian twos-complement integer representation of complete value
      // (ignoring precision) Always 16 bytes in length
      bytes value = 1;
      // The maximum number of digits allowed in the value.
      // the maximum precision is 38.
      int32 precision = 2;
      // declared scale of decimal literal
      int32 scale = 3;
    }

With fixed_char / fixed_binary I can at least form some kind of "minimum precision literal". E.g. given the literal "foo" I can turn it into a fixed_char<3>.

I can't do that with timestamp. Given the literal 1710989532 how am I supposed to know if this is:

Timestamp(0) - Thursday, March 21, 2024 2:52:12 AM Timestamp(3) - Tuesday, January 20, 1970 7:16:29.532 PM Timestamp(6) - Thursday, January 1, 1970 12:28:30.989532 AM Timestamp(9) - Thursday, January 1, 1970 12:00:01.710989532 AM

All of those are valid timestamps.

westonpace avatar Mar 21 '24 02:03 westonpace

The core issue is that there isn't a fixed issue of the literal (it's either microseconds or nanoseconds but not both). I'd argue that the definition should just be nanoseconds since epoch.

Having the precision in the type would certainly make things easier for consumers (by greatly reducing the number of casts where literals are present).

The precision of varchar and decimal might be related to an older specification version, I've got examples of explicit casting to add precision. It also might be that the casting is there because the SQL parser is recognizing string an integer values and needed to cast them after the fact.

EpsilonPrime avatar Mar 21 '24 04:03 EpsilonPrime

Ah, good point, there are really only two possibilities, not four.

Yes, if it was just nanoseconds then that would be easier. That was part of the original proposal. Unfortunately, the range of 64-bit nanosecond timestamps is only about ~550 years. At the time we declared 10k years was the range of our timestamp data type. I believe we ended up dropping any declaration of range as part of the PR.

Still, nanoseconds-only would be rather limiting (we support 10k years on date, interval_year and interval_day).

However, if we want to do nanoseconds-only, then we could make it a 128-bit integer.

It also might be that the casting is there because the SQL parser is recognizing string an integer values and needed to cast them after the fact.

That seems very possible. In SQL there are only a few literal types like strings (e.g. 'foo') and numbers (e.g. 11). Even timestamps are just "strings" in the beginning. Late inference is very much a necessity. The fact that Substrait doesn't require this is definitely convenient for consumers.

Although, some of that late inference is still often needed anyways. SQL parsers will often turn a numeric literal either into the smallest available integer type or just always use u64. Either way a consumer will probably have to apply some kind of cast to apply the literal in an operation so that it can fit the literal to the data type.

westonpace avatar Mar 21 '24 11:03 westonpace

However, if we want to do nanoseconds-only, then we could make it a 128-bit integer.

That being said, I would still rather just declare the precision as part of the message. This is more consistent with our current definition of decimal / varchar and then I don't have to worry about dealing with bigint.

westonpace avatar Mar 21 '24 11:03 westonpace

I'm onboard with fully specifying the literal. We should also include the missing timezone part in the tz variant as well.

EpsilonPrime avatar Mar 21 '24 22:03 EpsilonPrime

We should also include the missing timezone part in the tz variant as well.

I'm not sure what you mean by this. The timestamp_tz data type does not store a timezone-per-cell. What is the missing timezone part?

westonpace avatar Mar 22 '24 15:03 westonpace

Done by #659

Blizzara avatar Jul 12 '24 13:07 Blizzara