spark [SPARK-47424][SQL] Add getDatabaseCalendar method to the JdbcDialect

What changes were proposed in this pull request?

New dialect method that returns calendar that will be used in getTimestamp or setTimestamp in JdbcUtils.scala. By default, the return value is local session timezone calendar and the config for this can be turned off and then the return value is null. Also, H2 dialect has bad implementation of get/setTimestamp as it doesn't do calculations right if you pass the calendar, so for H2 the decision is to just override this method and return null calendar.

Why are the changes needed?

It turns out that the users will see same 'string values' of timestamps as they are in remote data source only if JVM time zone and spark local session time zone are equal. If users change JVM time zone (which is possible) or spark local session time zone, then they are not going to see same results. Therefore, when calling getTimestamp and setTimestamp, calendar with spark local session time zone is passed. This is true for some data sources, and therefore these dialects could override getDatabaseCalendar method. If the change is risky, it is advised to use feature flag in overridden method.

Does this PR introduce any user-facing change?

Only if the dialect overrides getDatabaseCalendar method. Right now, no dialect is doing so, so this is not user facing change.

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

No

Mar 15 '24 19:03 PetarVasiljevic-DB

cc @yaooqinn

Mar 18 '24 02:03 cloud-fan

Can we elaborate a bit more about the issue here？ I'm not sure whether we see a wrong string representation of a timestamp or we get/set a wrong timestamp

Mar 18 '24 03:03 yaooqinn

Can we elaborate a bit more about the issue here？ I'm not sure whether we see a wrong string representation of a timestamp or we get/set a wrong timestamp

@yaooqinn When I mention timestamp I mean timestamp without time zone information. Let's say JVM time zone and spark local session time zone is PST. If timezone literal in remote data source is for example "2020-02-02 02:00:00" and you read it via getTimestamp JDBC API call, it will use default calendar and load some value. This read timestamp will be "2020-02-02 02:00:00-08" and when shown to user (via df.show() for example) it will be shown as "2020-02-02 02:00:00", the same as it is data source. This case is fine.

Now, let's say JVM time zone is PST, but spark local session time zone is UTC. Again, same timestamp will be in data source, and when read it will be "2020-02-02 02:00:00-08", but it is equal to "2020-02-02 10:00:00+00" and therefore, "2020-02-02 10:00:00" will be shown to user, which is not the same as in data source. This is problematic.

Keep in mind that we map NTZ timestamp from data source to TimestampType in spark (which has time zone information) and because of this, we are also reading/writing wrong values (different epochs), but it is not the scope of this PR. In this PR, it is only fixed what user sees. So, the problem is that user sees wrong timestamp string representation when JVM time zone and spark local session time zone are not matched.

Mar 18 '24 13:03 PetarVasiljevic-DB

@PetarVasiljevic-DB You say about JVM time zone in PR's description, but all tests run in the same default JVM time zone. Could you, please, try to change it via withDefaultTimeZone()

@MaxGekk Done, please note my comment on change.

Mar 18 '24 13:03 PetarVasiljevic-DB

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sep 12 '24 00:09 github-actions[bot]