cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Timezone conversions for timestamps

Open jlowe opened this issue 5 years ago • 9 comments

Is your feature request related to a problem? Please describe. Apache Spark internally tracks timestamps relative to the epoch in the JVM's default timezone. cudf uses timestamps relative to the epoch in the UTC timezone (at least in the ORC loader). When Spark tries to interpret timestamp data created/loaded by cudf the values are incorrect (from Spark's viewpoint) unless the JVM's default timezone is UTC.

Describe the solution you'd like A timezone conversion utility function for timestamps in libcudf would help, as the timestamp column could be converted from UTC to the JVM's timezone. The utility function would take an input timestamp column, a source timezone and a destination timezone, and would return a new timestamp column with the timestamp values adjusted by the time delta difference between the two timezones at the source timestamp's time. Note that due to effects like daylight savings, the delta between timezones is not necessarily constant and therefore this utility method can't be obviated by a simple scalar addition on the timestamp column data.

Describe alternatives you've considered Adding an optional, target timezone parameter to the various cudf data loaders (like the ORC reader) could work as well and potentially be more efficient if it avoids the separate CUDA kernel to do the conversion from UTC. However the utility method approach seems like a more flexible approach to cover cases where the timestamp data source didn't originate from a cudf data loader.

jlowe avatar Aug 06 '19 20:08 jlowe

Is this specific to ORC ? ORC has the concept of a writer timezone, and we actually convert from the writer timezone to UTC when loading, so if the intent is to keep the native timezone of the ORC file, it could be as simple as a flag to bypass the existing timezone conversion. If it's a generic target timezone conversion, it should be relatively easy to do it in the ORC loader (converting to a particular timezone isn't much different than the existing conversion to UTC). For other loaders that are natively UTC, it would be preferable as a standalone utility method.

OlivierNV avatar Aug 07 '19 03:08 OlivierNV

It's not specific to ORC. The intent is not to preserve the timezone of the file but rather to target a timezone that's specified by the caller. ORC is already doing timezone conversions, and that's why I mention it here. It would be slightly more efficient to have ORC directly convert to a specified timezone rather than convert to UTC and then the caller converts to another timezone.

I agree that a standalone utility method would be best if we're only going to have one way to rebase timestamps. That can apply to the CSV and Parquet loaders which don't do any timezone conversions today.

The utility would also provide a way to convert Spark timestamps back to UTC. This would probably be needed when cudf supports writing CSV, Parquet, and ORC, as I would expect those writers to assume all timestamps are UTC. Spark would need to convert timestamps relative to the JVM timezone to UTC for cudf writing unless the writers supported a parameter specifying the timestamp timezone reference.

jlowe avatar Aug 07 '19 13:08 jlowe

Timezone conversions are super-messy, so I'd rather limit the impact of timezones (it also tends to be platform-specific since we need to parse timezone files from the OS to build conversion tables from/to UTC). Compared to the column loader, a standalone timezone conversion should be very fast since it's fully data-parallel (internally, timezoneA->timezoneB would most likely be implemented as tsB = fromUTC(toUTC(tsA, A), B). If the goal is an arbitrary target timezone, I think an in-place standalone conversion would be the way to go.

OlivierNV avatar Aug 08 '19 03:08 OlivierNV

Since we are using libcu++ for our timestamp types, we should be using the C++20 timezone APIs that libcu++ will provide. CC @trxcllnt.

jrhemstad avatar Dec 04 '19 22:12 jrhemstad

@brycelelbach does libcu++ have plans to load the system timezone database? I recall @ogiroux saying that might not be supported yet.

trxcllnt avatar Dec 05 '19 01:12 trxcllnt

@brycelelbach promised me they would add it for us :)

jrhemstad avatar Dec 05 '19 01:12 jrhemstad

Any update on whether libcu++ will load the timezone database?

sameerz avatar Nov 12 '20 22:11 sameerz

Please file an RFE on github.com/NVIDIA/libcudacxx. It's not on our immediate roadmap, and slightly tricky, but we'll get to it eventually, depending on how urgently you need it.

brycelelbach avatar Nov 12 '20 23:11 brycelelbach

https://github.com/NVIDIA/cccl/issues/954

jrhemstad avatar Nov 13 '20 19:11 jrhemstad