velox icon indicating copy to clipboard operation
velox copied to clipboard

Comparisons for values of logical types are not handled correctly throughout the library

Open mbasmanova opened this issue 7 months ago • 7 comments

Bug description

TIMESTAMP WITH TIME ZONE logical type is backed by BIGINT physical type. The timestamp values are stored in memory as 64-bit integers using an encoding that doesn't allow for direct comparisons of these integers.

https://facebookincubator.github.io/velox/develop/timestamp.html

TimestampWithTimezone physically packs two integers in a single 64 word, using 12 bits for timezone ID, and 52 bits for a millisecond-precision timestamp.

However, may places in the core engine are applying equality and comparisons to the physical value without considering its logical semantics.

One example, is aggregation with grouping keys of type TIMESTAMP WITH TIME ZONE returns incorrect result. '2024-04-10 10:00 America/New_York' and '2024-04-10 07:00 America/Los_Angeles' represent the same timestamp, but appear as different groups in aggregation results:

  auto data = makeRowVector({
      makeFlatVector<std::string>({
          "2024-04-10 10:00 America/New_York",
          "2024-04-10 07:00 America/Los_Angeles",
      }),
  });

  auto plan = PlanBuilder()
                  .values({data})
                  .project({"cast(c0 as timestamp with time zone)"})
                  .singleAggregation({"p0"}, {"count(1)"})
                  .project({"cast(p0 as varchar)", "a0"})
                  .planNode();

  auto results = AssertQueryBuilder(plan).copyResults(pool());
  LOG(ERROR) << results->toString();
  LOG(ERROR) << results->toString(0, 10);


[ROW ROW<p0:VARCHAR,a0:BIGINT>: 2 elements, no nulls]

0: {2024-04-10 10:00:00.000 America/New_York, 1}
1: {2024-04-10 07:00:00.000 America/Los_Angeles, 1}

In Presto,

presto:di> select cast(x as timestamp with time zone), count(1) from unnest(array['2024-04-10 07:00 America/Los_Angeles', '2024-04-10 10:00 America/New_York']) as t(x) group by 1;
                    _col0                    | _col1
---------------------------------------------+-------
 2024-04-10 07:00:00.000 America/Los_Angeles |     2
(1 row)

presto:di> select x, count(1) from unnest(array['2024-04-10 07:00 America/Los_Angeles', '2024-04-10 10:00 America/New_York']) as t(x) group by 1;
                  x                   | _col1
--------------------------------------+-------
 2024-04-10 07:00 America/Los_Angeles |     1
 2024-04-10 10:00 America/New_York    |     1
(2 rows)

All operators that perform comparisons are affected by this issue, e.g. Aggregation, Join, OrderBy.

CC: @kgpai @Yuhta @bikramSingh91 @kagamiori @amitkdutta

System information

n/a

Relevant logs

No response

mbasmanova avatar Jun 27 '24 13:06 mbasmanova