pinot icon indicating copy to clipboard operation
pinot copied to clipboard

support TIMESTAMP WITH TIME ZONE type

Open agavra opened this issue 3 years ago • 6 comments

This PR supports TIMESTAMP WITH TIME ZONE type in Pinot modeled after #6719. All specifications are intended to follow https://www.postgresql.org/docs/current/datatype-datetime.html

The change required the following steps:

  1. add support to Calcite because it isn't yet supported (we will likely contribute this back, see https://issues.apache.org/jira/browse/CALCITE-5301)
  2. change usage of Timestamp to LocalDateTime so that we can introduce the zoned counterpart, OffsetDateTime
  3. propagate the change through the code base
  4. introduce PinotJsonTimeModule to support JSON SerDe with the semantics that we want
  5. See TimestampUtil for the custom handling of timestamps5.
  6. See TimestampQueriesTest for some end-to-end examples of usage

Additional manual testing done:

I followed https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot but instead added timestampTz column.

studentID,firstName,lastName,gender,subject,score,timestampInEpoch,timestampTz
200,Lucy,Smith,Female,Maths,3.8,1570863600000,2019-10-12T07:00:00.000Z
200,Lucy,Smith,Female,English,3.5,1571036400000,2019-10-11T23:00:00.000-08:00
201,Bob,King,Male,Maths,3.2,1571900400000,2019-10-24T00:00:00.000Z
202,Nick,Young,Male,Physics,3.6,1572418800000,2019-10-30T07:00:00.000Z

Here's the change to the schema file:

  "dateTimeFieldSpecs": [{
    "name": "timestampInEpoch",
    "dataType": "LONG",
    "format" : "1:MILLISECONDS:EPOCH",
    "granularity": "1:MILLISECONDS"
  },{
    "name": "timestampTz",
    "dataType": "TIMESTAMP_WITH_TIME_ZONE",
    "format" : "1:MILLISECONDS:EPOCH",
    "granularity": "1:MILLISECONDS"
  }]

And get the following results (notice that Lucy was correctly converted from offset -8:00 to Z)

select * from transcript limit 10;
{"resultTable":
  {"dataSchema":
    {
      "columnNames":["firstName","gender","lastName","score","studentID","subject","timestampInEpoch","timestampTz"],
      "columnDataTypes":["STRING","STRING","STRING","FLOAT","INT","STRING","LONG","TIMESTAMP_WITH_TIME_ZONE"]},
      "rows":
        [["Lucy","Female","Smith",3.8,200,"Maths",1570863600000,"2019-10-12 07:00:00Z"],
        ["Lucy","Female","Smith",3.5,200,"English",1571036400000,"2019-10-12 07:00:00Z"],
        ["Bob","Male","King",3.2,201,"Maths",1571900400000,"2019-10-24 00:00:00Z"],
        ["Nick","Male","Young",3.6,202,"Physics",1572418800000,"2019-10-30 07:00:00Z"]]
    }
  }
}

Note that the following are still TODO:

  • a few more pointed test cases
  • conversion between TIMESTAMP WITH TIME ZONE and TIMESTAMP WITHOUT TIME ZONE will be done as part of #9480
  • we will need a follow up to allow registering functions with the same name but different input types in the scalar function registry

agavra avatar Oct 06 '22 22:10 agavra

Codecov Report

Merging #9546 (b07c516) into master (1c202c7) will increase coverage by 7.06%. The diff coverage is 52.27%.

@@             Coverage Diff              @@
##             master    #9546      +/-   ##
============================================
+ Coverage     62.86%   69.92%   +7.06%     
- Complexity     5073     5214     +141     
============================================
  Files          1917     1933      +16     
  Lines        102523   103328     +805     
  Branches      15561    15668     +107     
============================================
+ Hits          64449    72252    +7803     
+ Misses        33257    25996    -7261     
- Partials       4817     5080     +263     
Flag Coverage Δ
integration1 25.98% <6.43%> (-0.04%) :arrow_down:
integration2 24.53% <7.57%> (-0.24%) :arrow_down:
unittests1 67.24% <51.51%> (-0.07%) :arrow_down:
unittests2 15.62% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ter/predicate/EqualsPredicateEvaluatorFactory.java 77.00% <0.00%> (-1.58%) :arrow_down:
.../filter/predicate/InPredicateEvaluatorFactory.java 74.43% <ø> (-4.52%) :arrow_down:
.../predicate/NotEqualsPredicateEvaluatorFactory.java 72.80% <0.00%> (-0.41%) :arrow_down:
...lter/predicate/NotInPredicateEvaluatorFactory.java 78.16% <ø> (ø)
...core/operator/filter/predicate/PredicateUtils.java 64.70% <0.00%> (-3.48%) :arrow_down:
...lter/predicate/RangePredicateEvaluatorFactory.java 85.26% <ø> (ø)
...m/function/JsonExtractScalarTransformFunction.java 54.06% <ø> (ø)
...not/core/query/reduce/GroupByDataTableReducer.java 81.53% <0.00%> (-0.37%) :arrow_down:
...rc/main/java/org/apache/pinot/spi/data/Schema.java 75.36% <ø> (ø)
...r/transform/function/LiteralTransformFunction.java 64.60% <16.66%> (-4.63%) :arrow_down:
... and 358 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Oct 11 '22 23:10 codecov-commenter

possibly related: https://github.com/apache/pinot/issues/8045

walterddr avatar Oct 13 '22 01:10 walterddr

This is a great feature! I'd suggest starting a high level design on all the date and time related types, and their behaviors so that we are on the same page when implementing other types.

One main thing I want to discuss is whether to carry the time zone info with the value. Seems PostgreSQL doesn't carry time zone in timestamp with time zone, but does carry time zone in time with time zone which is quite confusing. Presto does carry the time zone with the value.

One benefit of carrying the time zone with the value is that we can re-construct the same value using the data_time_format and preserve the time zone info from the original input value, so I'm leaning towards the Presto convention.

cc @kishoreg @xiangfu0 @walterddr @siddharthteotia for further discussion

Jackie-Jiang avatar Oct 17 '22 22:10 Jackie-Jiang

This is a great feature! I'd suggest starting a high level design on all the date and time related types, and their behaviors so that we are on the same page when implementing other types.

my vote is to follow postgres (date/time)

One main thing I want to discuss is whether to carry the time zone info with the value. Seems PostgreSQL doesn't carry time zone in timestamp with time zone, but does carry time zone in time with time zone which is quite confusing. Presto does carry the time zone with the value.

the above observation is incorrect

  • postgres doesn't STORE timezone info, neither timestamp nor time type. it always stored as UTC.
  • however postgres does carry timezone while PROCESSING. this depends on:
    • with AT TIME ZONE and WITH TIME ZONE the result will be converted to the desired timezone
    • if no timezone is used in query, query engine wide configuration can be used to set for whether to convert all timestamp to local time, or without conversion and use timestamp at UTC as is.

One benefit of carrying the time zone with the value is that we can re-construct the same value using the data_time_format and preserve the time zone info from the original input value, so I'm leaning towards the Presto convention.

let's clarify on what the intent of "carrying" the timezone means, especially do we wan to support at storage time or compute time.

walterddr avatar Oct 18 '22 15:10 walterddr

@walterddr @Jackie-Jiang I did a survey of other systems and put the results in this doc: https://docs.google.com/document/d/1Dvz7NJ2eliIFm6O1Ra4Azs_WFWR98kHOyDCznIwaXow/edit#

My vote is also to follow the Postgres standard, though @Jackie-Jiang is right about the way that time with time zone is handled (the time zone is actually stored alongside the data).

I'll loop in others on the doc and we can continue the discussion there.

agavra avatar Oct 18 '22 18:10 agavra

@walterddr @Jackie-Jiang I did a survey of other systems and put the results in this doc: https://docs.google.com/document/d/1Dvz7NJ2eliIFm6O1Ra4Azs_WFWR98kHOyDCznIwaXow/edit#

My vote is also to follow the Postgres standard, though @Jackie-Jiang is right about the way that time with time zone is handled (the time zone is actually stored alongside the data).

I'll loop in others on the doc and we can continue the discussion there.

discussed offline and yet @Jackie-Jiang and @agavra is right. postgres doesn't carry timezone info during computation either, continuing discussion in https://github.com/apache/pinot/issues/9622

walterddr avatar Oct 18 '22 21:10 walterddr