spark-rapids
spark-rapids copied to clipboard
[FEA] Implement kernel to support non-UTC time zone for LEGACY mode.
Describe the bug
this is failing non-utc test in 24.10, first seen in rapids_it-non-utc-pre_release, run:123
FAILED ../../src/main/python/date_time_test.py::test_yyyyMMdd_format_for_legacy_mode[DATAGEN_SEED=1728130368, TZ=Africa/Casablanca] - AssertionError: GPU and CPU int values are different at [14, 'unix_timestamp(a, yyyyMMdd)']
cpu = 231388876800, gpu = 231388873200
@pytest.mark.skipif(not is_supported_time_zone(), reason="not all time zones are supported now, refer to https://github.com/NVIDIA/spark-rapids/issues/6839, please update after all time zones are supported")
# Test years after 1900, refer to issues: https://github.com/NVIDIA/spark-rapids/issues/11543, https://github.com/NVIDIA/spark-rapids/issues/11539
def test_yyyyMMdd_format_for_legacy_mode():
gen = StringGen('(19[0-9]{2}|[2-9][0-9]{3})([0-9]{4})')
> assert_gpu_and_cpu_are_equal_sql(
lambda spark : unary_op_df(spark, gen),
"tab",
'''select unix_timestamp(a, 'yyyyMMdd'),
from_unixtime(unix_timestamp(a, 'yyyyMMdd'), 'yyyyMMdd'),
date_format(to_timestamp(a, 'yyyyMMdd'), 'yyyyMMdd')
from tab
''',
{'spark.sql.legacy.timeParserPolicy': 'LEGACY',
'spark.rapids.sql.incompatibleDateFormats.enabled': True})
../../src/main/python/date_time_test.py:466:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../src/main/python/asserts.py:641: in assert_gpu_and_cpu_are_equal_sql
assert_gpu_and_cpu_are_equal_collect(do_it_all, conf, is_cpu_first=is_cpu_first)
../../src/main/python/asserts.py:599: in assert_gpu_and_cpu_are_equal_collect
_assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
../../src/main/python/asserts.py:521: in _assert_gpu_and_cpu_are_equal
assert_equal(from_cpu, from_gpu)
../../src/main/python/asserts.py:111: in assert_equal
_assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
../../src/main/python/asserts.py:43: in _assert_equal
_assert_equal(cpu[index], gpu[index], float_check, path + [index])
../../src/main/python/asserts.py:36: in _assert_equal
_assert_equal(cpu[field], gpu[field], float_check, path + [field])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cpu = 231388876800, gpu = 231388873200
float_check = <function get_float_check.<locals>.<lambda> at 0x7f7b98cebbe0>
path = [14, 'unix_timestamp(a, yyyyMMdd)']
Steps/Code to reproduce bug Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.
Expected behavior Pass or ignore the case
Environment details (please complete the following information)
- Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
- Spark configuration settings related to the issue
Additional context Add any other context about the problem here.
also cc @res-life to help, this case is still unstable in non-utc environment
Spark behavior:
scala> spark.conf.set("spark.sql.session.timeZone", "Africa/Casablanca")
scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "CORRECTED")
scala> spark.sql("select unix_timestamp('42481005', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(42481005, yyyyMMdd)|
+----------------------------------+
| 71910716400|
+----------------------------------+
scala>
scala> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
scala> spark.sql("select unix_timestamp('42481005', 'yyyyMMdd')").show()
+----------------------------------+
|unix_timestamp(42481005, yyyyMMdd)|
+----------------------------------+
| 71910720000|
+----------------------------------+
GPU limitation
GPU kernel is consistent with the CORRECTED mode, does not fully support LEGACY mode.
Follow-up
For LEGACY mode, need to implement:
Spark code for LEGACY mode: Link
It uses SimpleDateFormat:
class LegacySimpleTimestampFormatter(
pattern: String,
zoneId: ZoneId,
locale: Locale,
lenient: Boolean = true) extends TimestampFormatter {
@transient private lazy val sdf = {
val formatter = new SimpleDateFormat(pattern, locale)
formatter.setTimeZone(TimeZone.getTimeZone(zoneId))
formatter.setLenient(lenient)
formatter
}
override def parse(s: String): Long = {
fromJavaTimestamp(new Timestamp(sdf.parse(s).getTime))
}
Workaround
Disable this case for branch 24.10 when TZ is not UTC or Asia/Shanghai
Update document to clarify that not all non-DST(daylight saving time) time zones are supported, only tested Asia/Shanghai timezone.
Let's use this issue to track the support for Africa/Casablanca time zone in legacy mode.
thanks! retargeted issue to 24.12
We do not want to implement a kernel just for Africa/Casablanca unless we implement it for all time zones!!!
https://github.com/NVIDIA/spark-rapids/issues/6839
We picked Africa/Casablanca as an arbitrary time zone. It is one of the more complicated ones in terms of rules, but it is not there because a customer needs it. If we are going to develop a solution to a problem like this we want to develop a general purpose solution.
The issue https://github.com/NVIDIA/spark-rapids/issues/6839 is for non-LEGACY mode. This requirement is for LEGACY mode. So this is another requirement.
For legacy mode, from the code
It invokes LegacySimpleTimestampFormatter and gets a java.sql.Timestamp, then invokes fromJavaTimestamp and rebaseJulianToGregorianMicros. So this it will be another kernel.
Spark has different behaviors between LEGACY mode and non-LEGACY mode.
I think the priority for this issue is low.