parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

PARQUET-675: Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types

Open julienledem opened this issue 9 years ago • 12 comments

julienledem avatar Oct 26 '16 18:10 julienledem

Ugh, we didn't realize that this hadn't been merged. We've been using the representations here with multiple customers for the last six months using the annotations and formats above so the reality is that there are files with the representation described in the PR in the wild at this point.

I'm happy to help get this over the finish line but we should wrap up discussion.

Michal is recommending a collapsed format for day. However, I think that is a bit surprising to users since it is lossy. A user may want to express 28 hours. The collapsed format would replace this with 1 day, 4 hours so I'm inclined to stick with the separate, non-lossy representation. It is a trivial calculating if you want to canonicalize for internal processing but the storage should be lossless.

@rdblue and @julienledem, what are you thoughts to the previous points. It seems like there was some discussion of a group but Julien and myself preferred a consolidated approach since we don't see any scenario where you'd only be interacting with one of the two items in the group. Additionally, in general, since Intervals are such a small part of every dataset I've seen in the real world, I'm not inclined to spend a lot of engineering work trying to optimize them too extensively. That's probably just laziness talking... but I like to think of it as pragmatism :)

jacques-n avatar Aug 12 '17 16:08 jacques-n

My suggestion to use the collapsed format was based on the assumption that the point of moving away from the old Interval format was to make it closer to the SQL standard definition of the interval data type. However the classic SQL implementations are usually far from implementing the standard in the first place. MySQL and MS SQL Server do not have interval as a storable data type at all. PostgreSQL allows literals and declaration as standard (with an exception of explicit precision, like interval hour to second(3)), but internally it is all one type holding 3 components (months, days and seconds; like Parquet now, but probably as floating point numbers). Even then it equates 1 day with 24 hours (and surprisingly also 1 month with 30 days) for comparison and order, and it is inconsistent with conversion between those fields:

> select interval '2' day
0 years 0 mons 2 days 0 hours 0 mins 0.00 secs
> select interval '2' day / 4 * 4
0 years 0 mons 0 days 48 hours 0 mins 0.00 secs
> select interval '1' month / 4 * 4
0 years 0 mons 28 days 48 hours 0 mins 0.00 secs

From the common SQL implementations Oracle seem to be the closest to the standard. The bottom line is that with the current state of the SQL implementations it is hard to guess what is the user expectation when it comes to the semantics of the interval types.

Either way, I think, the comparison semantics should be well defined within Paruqet format as this influences filter pushdown features like min max statistics and hashing. Of course if there is no linear order on interval then min max cannot be even defined. This issue is important because it may affect correctness of the systems interfacing Paruqet, even if efficiency is not the key for this data type.

michal-databricks avatar Aug 14 '17 11:08 michal-databricks

Whatever decision is reached here, could we incorporate this into the related Arrow patch https://github.com/apache/arrow/pull/920 so that we can round-trip to Arrow memory without any data loss?

wesm avatar Aug 15 '17 04:08 wesm

@julianhyde, would love your thoughts on this as you know more than anybody on these things.

Oracle: SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '1' HOUR FROM DUAL 2017-03-12 01:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '2' HOUR FROM DUAL 2017-03-12 03:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '3' HOUR FROM DUAL 2017-03-12 04:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '24' HOUR FROM DUAL 2017-03-13 01:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '1' DAY FROM DUAL 2017-03-13 01:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '86400' SECOND(2,4) FROM dual 2017-03-13 01:00:00

Postgres: SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '1' HOUR 2017-03-12 01:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '2' HOUR 2017-03-12 03:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '3' HOUR 2017-03-12 03:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '24' HOUR 2017-03-13 00:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 PST' + INTERVAL '1' day 2017-03-13 00:00:00

SELECT TIMESTAMP '2017-03-12 0:00:00 US/Pacific' + INTERVAL '86400' SECOND 2017-03-13 00:00:00

So people don't agree on what adding intervals across time skips mean. However, they each agree internally that skipping is the same whether we're using 1 day or 86400seconds. This means that interval should be comparable.

On the flipside, Postgres maintains a distinction between 25 hours and 1 day, 1 hour whereas Oracle does not

Oracle: SELECT INTERVAL '25' HOUR FROM dual 1 1:0:0.0

SELECT INTERVAL '1 1' DAY TO HOUR FROM dual 1 1:0:0.0

Postges: select INTERVAL '25' hour 0 years 0 mons 0 days 25 hours 0 mins 0.00 secs

select INTERVAL '1 1' day to hour 0 years 0 mons 1 days 1 hours 0 mins 0.00 secs

jacques-n avatar Aug 17 '17 18:08 jacques-n

These examples are complicated. I don't recall whether the Oracle ones will use TIMESTAMP WITH TIME ZONE or TIMESTAMP WITH LOCAL TIME ZONE literals. And Postgres will use TIMESTAMP WITH TIME ZONE, but it's semantics are non-standard.

Also, I don't know whether PST and US/Pacific are the same, and whether either of them time-shifts.

I'm not an expert on TIMESTAMP WITH TIME ZONE. Evidently its arithmetic is more complicated than TIMESTAMP, so someone should consult the sql standard.

Could you punt on TIMESTAMP WITH TIME ZONE for now?

Regarding your second example. I'm pretty sure both systems will give the two literals different types - interval hour and interval day to hour. So they are distinct in both systems, regardless of how they are printed. Distinct in the same sense that integer 2 and double 2 are distinct. They have different types, may or may not be printed as the same string, but yield the same value if converted to each other's type.

IIRC calcite would print '1 1' and '25', but store 25 * 60 * 60 * 1000 for both. If arrow is being frugal it should store 25 for both.

julianhyde avatar Aug 18 '17 08:08 julianhyde

@jacques-n, Postgres does actually make use of the difference between hours and days in some cases:

SET TIME ZONE 'US/Pacific';

SELECT TIMESTAMP WITH TIME ZONE '2017-03-12 0:00:00 PST' + INTERVAL '24' HOUR;
2017-03-13T08:00:00Z

SELECT TIMESTAMP WITH TIME ZONE '2017-03-12 0:00:00 PST' + INTERVAL '1' DAY;
2017-03-13T07:00:00Z

However the docs clearly state this is not standard SQL: "(...) the SQL standard has an odd mix of date and time types and capabilities. Two obvious problems are: (...) The default time zone is specified as a constant numeric offset from UTC. It is therefore impossible to adapt to daylight-saving time when doing date/time arithmetic across DST boundaries. (...) To address these difficulties, (...) PostgreSQL assumes your local time zone for any type containing only date or time." (From 8.5.3. Time Zones). The literal syntax it uses is also non standard. WITH TIME ZONE is not necessary nor allowed, instead the result type should be derived from the literal string:

         17)The data type of a <timestamp literal> that does not specify
            <time zone interval> is TIMESTAMP(P), where P is the number of
            digits in <seconds fraction>, if specified, and 0 otherwise.
            The data type of a <timestamp literal> that specifies <time zone
            interval> is TIMESTAMP(P) WITH TIME ZONE, where P is the number
            of digits in <seconds fraction>, if specified, and 0 otherwise.

Without non-standard time zone implementation the distinction between days and 24 hour wouldn't be very useful. Also while for Postgres DATETIME + INTERVAL semantic is as above, for DATETIME - DATETIME the result assumes 24h = 1 day which leads to confusing results like: timestamp1 + (timestamp2 - timestamp1) <> timestamp2 The assumption is true for comparisons as well, BTW:

SELECT INTERVAL '24:01' HOUR TO MINUTE > INTERVAL '1' DAY;
true

SELECT INTERVAL '24' HOUR = INTERVAL '1' DAY;
true

Either way the current proposal in the PR cannot represent Postrgres intervals because of the small range for seconds and potential mixing of months and seconds:

select interval '20' hour * 10e7 + interval '1' month;
0 years 1 mons 0 days 2000000000 hours 0 mins 0.00 secs

@julianhyde, It is true that INTERVAL '1' DAY and INTERVAL '24' HOUR are by the standard of distinct types even when equal. But this can be typically handled on the matadata level.

From Parquet point of view the most flexible way would be to have parametrized interval type (like decimal) with least significant field/seconds scale. So one of: DAY, HOUR, MINUTE, SECOND, SECOND(1), ... SECOND(9). As this may be excessive implementation wise for such a minor type, a middle ground would be to have at least DAY, MILLISECOND (SECOND(3)), and MICROSECOND(SECOND(6)) scales. This is made even easier with the new logical types PR.

michal-databricks avatar Aug 18 '17 09:08 michal-databricks

There's a lot of behavior to consider for processing engines here, but the conclusion I think we can draw from all of this information for Parquet is that engines may have semantics where 24 hours and 1 day are not equal, so Parquet should not force engines to consider them equal by storing a single value.

That means what we're back where we started: we have two values to store for these types and need to choose a good way to store them. I don't see a down-side to separate columns, since we can easily construct a single column in Arrow from them when deserializing.

rdblue avatar Aug 18 '17 16:08 rdblue

So, Postgres diverges from the SQL standard and allows time zones that don't have fixed offsets from UTC. This seems useful, but as Jacques has noted, it leads to madness. So, let's stick to the standard.

And, if I read you correctly, Postgres allows intervals with a mixture of hours and months. Again, madness.

At a minimum, Arrow needs 2 interval types: INTERVAL MONTH and INTERVAL SECOND. Optionally, we could also have finer-grained types. For instance, INTERVAL HOUR TO SECOND would be compatible with INTERVAL SECOND but would use a format string of 'HH:MM:SS'.

I don't feel strongly whether Arrow has 2 interval types or goes for the full list in standard SQL: YEAR, YEAR_TO_MONTH, MONTH, DAY, DAY_TO_HOUR, DAY_TO_MINUTE, DAY_TO_SECOND, HOUR, HOUR_TO_MINUTE, HOUR_TO_SECOND, MINUTE, MINUTE_TO_SECOND, SECOND. Postgres even has a few more: QUARTER, MILLENNIUM, MILLISECOND etc.

I think a reasonable level of support in Arrow is precision and scale, e.g. DAY(5) TO SECOND(6). implies that the value holds microsecond values between -8,640,000,000,000,000 and 8,640,000,000,000,000 (the number of microseconds in 10,000 days). Thus behind the scenes it looks like a fixed decimal type.

julianhyde avatar Aug 18 '17 16:08 julianhyde

@julianhyde my thinking in my patch https://github.com/apache/arrow/pull/920/files was to add a time resolution to the DAY_TIME interval type to address resolutions from seconds down to nanoseconds. This matches up with the timedelta64[UNIT] type found in some other systems. I wasn't sure whether this would be adequate (4 different 64-bit interval types, plus a YEAR_MONTH type) -- it's appealing from a time arithmetic point of view (because differences of timestamps yield these interval values quite gracefully). Do we need a 16-byte interval type or is 8 bytes enough?

wesm avatar Aug 28 '17 21:08 wesm

Do whatever you have for timestamp. If you only have an 8 byte timestamp type, an 8 byte interval will be enough. But if you have 16 byte timestamp, it will generate 16 byte values if you subtract them.

julianhyde avatar Aug 28 '17 22:08 julianhyde

Understood. Should it become necessary in the future, we can augment the metadata to accommodate additional precision. I think 8 bytes will be adequate for the vast majority of use cases for Arrow

wesm avatar Aug 28 '17 23:08 wesm

Hi everyone,

I've opened #165 to implement a LogicalType for the Interval. From looking at the conversations on this thread, it looks like the Arrow-related concerns have either been addressed, or might not be issues as we have stabilised on an Interval type in Arrow.

Pleas let me know if we're missing any details, or concerns to still address. I'm keen to get Arrow <> Parquet support for intervals, I'm working on the Parquet Rust implementation over at the Arrow repo.

Thanks

nevi-me avatar Jan 23 '21 11:01 nevi-me