spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA] Support China Standard Timezone(CST = UTC + 8) on GPU

Open viadea opened this issue 3 years ago • 3 comments

I wish we can support China Standard Timezone(CST = UTC + 8) on GPU.

Say server timezone is CST and user do not need to explicitly set user.timezome=UTC. As a result, the timestamp related function on CST can run on GPU and also the spark logs' timezone is shown in CST.

viadea avatar Sep 23 '22 03:09 viadea

We had an initial discussion with a developer in CUDF on how we might be able to support time zone transitions. He has a proof of concept that is not fully done up at https://github.com/rapidsai/cudf/pull/11872

Generally the patch needs a small database that holds the time zone transitions and can apply them to timestamps to convert to/from UTC.

The patch asks for the following columns.

  • zone_name name of the time zone (how it will be looked up)
  • country_code the country this time zone is a part of
  • abbreviation an short version of the name that is being transitioned to.
  • time_start seconds since the epoch when this operation takes effect
  • gmt_offset the new offset in seconds from UTC/GMT after the transition
  • dstappears to be a "1" if going to daylight savings time, else a 0

The patch has a database that they dumped themselves, and long term have plans to read the database directly from the standard Linux location. But we are concerned about the database being different compared to the one in java, both in terms of when/how it is updated and also in terms of localization of the names.

The plan right now is to dump as much of the database from java as possible using the java.time.ZoneIdclass. We can get a list of all zones from this class if neededgetAvailableZoneIdsor for a singleZoneId` as provided by Spark for many expressions or just as the current timezone.

Here we can get zone_name by calling getId. But we might need to think more about localization and what not when we parse strings timestamps, like with CSV and/or JSON. We also need to think about how the timezone is stored in ORC so we can make sure that we can support that as well.

I could not find any API to get country_code. Even though a timezone is often associated with a country java just does not expose this information. I also could not find any way to get abbreviation out. I don't think that these are strictly needed, but we will need to see. Some of it may be related to the local and once we know that we might be able to use it to get out what we need/want.

The rest of the columns come from specific rules to get to the rules for a ZoneId you call getRules. Under that there are two types of transition rules. One that is for set known transitions getTransitions and the other is for a set of rules that can be used to generate other transitions getTransitionRules. The rules can be used to generate specific transitions for a given year, so we can concentrate on just the ZoneOffsetTransition class that is returned as a part of a list from getTransitions, but we can run into some problems here, because we have to guess how many years to compute future transitions for. The example database in CUDF goes to the year 2499. For java we can go much further. Technically for TIMESTAMP_MICROSECONDS, which is what Spark uses for timestamps, we could go out to the year 586,912. It could be doable to go that far when generating the database, but I am not 100% sure we want to. But if we don't we would need to play games with looking for the maximum timestamp/etc and then add in the rules to that point.

  • time_start can come from .toEpochSecond
  • gmt_offset can come from .getOffsetAfter.getTotalSeconds
  • dst can come from .isGap

One thing to note is that we might need to also use .getOffsetBefore on the first transition to get the starting point for the first transition in the CUDF database. This is not an explicit transition, it is just what happened before we started to have rules (LMT). Not sure where the start time for LMT in their database comes from though, so not sure how to come up with something there.

This is just a first pass at looking at the APIs we will need to work with them to understand exactly what they need and why so we can work on a final solution.

revans2 avatar Oct 07 '22 15:10 revans2

Reading the prototype code it is actually very simple. A time zone table that is specific to a given time zone is sorted by the time when a transition occurs. When translating to UTC from a given timezone the occurrence time a lower bound (or upper bound I need to think about it more) is used against the occurrence time + utc_offset to get an index into the time period when this rule would apply. Then the utc_offset is subtracted from the current timestamp to put in in UTC.

The opposite is done when going from UTC to a specific time zone. We could do all of this today, or we could write out own implementation that knows how to use a transition rule, so we don't have to worry about not having generated enough lines.

revans2 avatar Oct 11 '22 19:10 revans2

Note that with timezones using daylight savings or other similar discontinuities where time can "roll back" there can be ambiguous mappings from a timestamp in those timezones to UTC during the rollback windows. For example, the timestamp 2022-11-06 01:30:00 in the US/Central timezone has an offset to UTC of either 5 or 6, as the time occurs twice in that timezone exactly an hour apart because of the daylight savings rollback of an hour that occurs at 2AM on that day.

A simple experiment with to_utc_timestamp shows that Spark is picking the earlier time in this ambiguous case, e.g.:

scala> val df = Seq("2022-11-06T00:30:00", "2022-11-06T01:00:00", "2022-11-06T01:30:00", "2022-11-06T02:00:00", "2022-11-06T02:30:00").toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: string]

scala> df.selectExpr("to_utc_timestamp(ts, 'US/Central') as tu").show
+-------------------+
|                 tu|
+-------------------+
|2022-11-06 05:30:00|
|2022-11-06 06:00:00|
|2022-11-06 06:30:00|
|2022-11-06 08:00:00|
|2022-11-06 08:30:00|
+-------------------+

jlowe avatar Oct 12 '22 18:10 jlowe