pinot
pinot copied to clipboard
support TIMESTAMP WITH TIME ZONE type
This PR supports TIMESTAMP WITH TIME ZONE type in Pinot modeled after #6719. All specifications are intended to follow https://www.postgresql.org/docs/current/datatype-datetime.html
The change required the following steps:
- add support to Calcite because it isn't yet supported (we will likely contribute this back, see https://issues.apache.org/jira/browse/CALCITE-5301)
- change usage of
TimestamptoLocalDateTimeso that we can introduce the zoned counterpart,OffsetDateTime - propagate the change through the code base
- introduce
PinotJsonTimeModuleto support JSON SerDe with the semantics that we want - See
TimestampUtilfor the custom handling of timestamps5. - See
TimestampQueriesTestfor some end-to-end examples of usage
Additional manual testing done:
I followed https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot but instead added timestampTz column.
studentID,firstName,lastName,gender,subject,score,timestampInEpoch,timestampTz
200,Lucy,Smith,Female,Maths,3.8,1570863600000,2019-10-12T07:00:00.000Z
200,Lucy,Smith,Female,English,3.5,1571036400000,2019-10-11T23:00:00.000-08:00
201,Bob,King,Male,Maths,3.2,1571900400000,2019-10-24T00:00:00.000Z
202,Nick,Young,Male,Physics,3.6,1572418800000,2019-10-30T07:00:00.000Z
Here's the change to the schema file:
"dateTimeFieldSpecs": [{
"name": "timestampInEpoch",
"dataType": "LONG",
"format" : "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
},{
"name": "timestampTz",
"dataType": "TIMESTAMP_WITH_TIME_ZONE",
"format" : "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"
}]
And get the following results (notice that Lucy was correctly converted from offset -8:00 to Z)
select * from transcript limit 10;
{"resultTable":
{"dataSchema":
{
"columnNames":["firstName","gender","lastName","score","studentID","subject","timestampInEpoch","timestampTz"],
"columnDataTypes":["STRING","STRING","STRING","FLOAT","INT","STRING","LONG","TIMESTAMP_WITH_TIME_ZONE"]},
"rows":
[["Lucy","Female","Smith",3.8,200,"Maths",1570863600000,"2019-10-12 07:00:00Z"],
["Lucy","Female","Smith",3.5,200,"English",1571036400000,"2019-10-12 07:00:00Z"],
["Bob","Male","King",3.2,201,"Maths",1571900400000,"2019-10-24 00:00:00Z"],
["Nick","Male","Young",3.6,202,"Physics",1572418800000,"2019-10-30 07:00:00Z"]]
}
}
}
Note that the following are still TODO:
- a few more pointed test cases
- conversion between
TIMESTAMP WITH TIME ZONEandTIMESTAMP WITHOUT TIME ZONEwill be done as part of #9480 - we will need a follow up to allow registering functions with the same name but different input types in the scalar function registry
Codecov Report
Merging #9546 (b07c516) into master (1c202c7) will increase coverage by
7.06%. The diff coverage is52.27%.
@@ Coverage Diff @@
## master #9546 +/- ##
============================================
+ Coverage 62.86% 69.92% +7.06%
- Complexity 5073 5214 +141
============================================
Files 1917 1933 +16
Lines 102523 103328 +805
Branches 15561 15668 +107
============================================
+ Hits 64449 72252 +7803
+ Misses 33257 25996 -7261
- Partials 4817 5080 +263
| Flag | Coverage Δ | |
|---|---|---|
| integration1 | 25.98% <6.43%> (-0.04%) |
:arrow_down: |
| integration2 | 24.53% <7.57%> (-0.24%) |
:arrow_down: |
| unittests1 | 67.24% <51.51%> (-0.07%) |
:arrow_down: |
| unittests2 | 15.62% <0.00%> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...ter/predicate/EqualsPredicateEvaluatorFactory.java | 77.00% <0.00%> (-1.58%) |
:arrow_down: |
| .../filter/predicate/InPredicateEvaluatorFactory.java | 74.43% <ø> (-4.52%) |
:arrow_down: |
| .../predicate/NotEqualsPredicateEvaluatorFactory.java | 72.80% <0.00%> (-0.41%) |
:arrow_down: |
| ...lter/predicate/NotInPredicateEvaluatorFactory.java | 78.16% <ø> (ø) |
|
| ...core/operator/filter/predicate/PredicateUtils.java | 64.70% <0.00%> (-3.48%) |
:arrow_down: |
| ...lter/predicate/RangePredicateEvaluatorFactory.java | 85.26% <ø> (ø) |
|
| ...m/function/JsonExtractScalarTransformFunction.java | 54.06% <ø> (ø) |
|
| ...not/core/query/reduce/GroupByDataTableReducer.java | 81.53% <0.00%> (-0.37%) |
:arrow_down: |
| ...rc/main/java/org/apache/pinot/spi/data/Schema.java | 75.36% <ø> (ø) |
|
| ...r/transform/function/LiteralTransformFunction.java | 64.60% <16.66%> (-4.63%) |
:arrow_down: |
| ... and 358 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
possibly related: https://github.com/apache/pinot/issues/8045
This is a great feature! I'd suggest starting a high level design on all the date and time related types, and their behaviors so that we are on the same page when implementing other types.
One main thing I want to discuss is whether to carry the time zone info with the value. Seems PostgreSQL doesn't carry time zone in timestamp with time zone, but does carry time zone in time with time zone which is quite confusing. Presto does carry the time zone with the value.
One benefit of carrying the time zone with the value is that we can re-construct the same value using the data_time_format and preserve the time zone info from the original input value, so I'm leaning towards the Presto convention.
cc @kishoreg @xiangfu0 @walterddr @siddharthteotia for further discussion
This is a great feature! I'd suggest starting a high level design on all the date and time related types, and their behaviors so that we are on the same page when implementing other types.
my vote is to follow postgres (date/time)
One main thing I want to discuss is whether to carry the time zone info with the value. Seems PostgreSQL doesn't carry time zone in
timestamp with time zone, but does carry time zone intime with time zonewhich is quite confusing. Presto does carry the time zone with the value.
the above observation is incorrect
- postgres doesn't STORE timezone info, neither
timestampnortimetype. it always stored as UTC. - however postgres does carry timezone while PROCESSING. this depends on:
- with
AT TIME ZONEandWITH TIME ZONEthe result will be converted to the desired timezone - if no timezone is used in query, query engine wide configuration can be used to set for whether to convert all timestamp to local time, or without conversion and use timestamp at UTC as is.
- with
One benefit of carrying the time zone with the value is that we can re-construct the same value using the data_time_format and preserve the time zone info from the original input value, so I'm leaning towards the Presto convention.
let's clarify on what the intent of "carrying" the timezone means, especially do we wan to support at storage time or compute time.
@walterddr @Jackie-Jiang I did a survey of other systems and put the results in this doc: https://docs.google.com/document/d/1Dvz7NJ2eliIFm6O1Ra4Azs_WFWR98kHOyDCznIwaXow/edit#
My vote is also to follow the Postgres standard, though @Jackie-Jiang is right about the way that time with time zone is handled (the time zone is actually stored alongside the data).
I'll loop in others on the doc and we can continue the discussion there.
@walterddr @Jackie-Jiang I did a survey of other systems and put the results in this doc: https://docs.google.com/document/d/1Dvz7NJ2eliIFm6O1Ra4Azs_WFWR98kHOyDCznIwaXow/edit#
My vote is also to follow the Postgres standard, though @Jackie-Jiang is right about the way that
time with time zoneis handled (the time zone is actually stored alongside the data).I'll loop in others on the doc and we can continue the discussion there.
discussed offline and yet @Jackie-Jiang and @agavra is right. postgres doesn't carry timezone info during computation either, continuing discussion in https://github.com/apache/pinot/issues/9622