spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA] Avoid CPU fallback due to date_format:Failed to convert Unsupported word: SSS null.

Open viadea opened this issue 3 years ago • 1 comments

I wish we can avoid CPU fallback due to date_format:Failed to convert Unsupported word: SSS null.

Reproduce:

import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.types._
 
var df = spark.sparkContext.parallelize(Seq(1)).toDF()
df=df.withColumn("value82", (lit("123456.78").cast(DecimalType(8,2)))).
           withColumn("value63", (lit("123.456").cast(DecimalType(6,3)))).
           withColumn("value1510", (lit("12345.0123456789").cast(DecimalType(15,10)))).
           withColumn("value2510", (lit("123456789012345.0123456789").cast(DecimalType(25,10)))).
           withColumn("value2901", (lit("1234567890123456789012345678.1").cast(DecimalType(29,1)))).
           withColumn("value3802", (lit("123456789012345678901234567890123456.01").cast(DecimalType(38,2)))).
           withColumn("timestring", (lit("1997-02-28 10:30:00.012")))

df.write.format("parquet").mode("overwrite").save("/tmp/df.parquet")
df=spark.read.parquet("/tmp/df.parquet")
df.createOrReplaceTempView("df")

spark.sql("SELECT date_format(timestring,'yyyy-MM-dd HH:mm:ss.SSS') FROM df").collect

Not-supported-messages:

!Expression <DateFormatClass> date_format(cast(timestring#494 as timestamp), yyyy-MM-dd HH:mm:ss.SSS, Some(Etc/UTC)) cannot run on GPU because Failed to convert Unsupported word: SSS null

viadea avatar Jul 29 '22 00:07 viadea

Looking at the new docs for parsing strings to timestamps

https://github.com/rapidsai/cudf/blob/e099e01c9b6ab8a2db5d5ee446b8843ee6199acc/cpp/include/cudf/strings/convert/convert_datetime.hpp#L62-L66

It looks like we might be able to convert SSS to %3f, because each S corresponds to a new factional digit of a second, just like the number in between the % and the f does on the formatting. This would still need to do a bunch of testing to be sure that if there is rounding that we match it/etc. But at least for parsing numbers we might be able to do this.

revans2 avatar Aug 02 '22 21:08 revans2

@sameerz @viadea Please pay attention to https://github.com/NVIDIA/spark-rapids/issues/6375. Although now we can support SSS, the example on top of this issue still can't run directly like date_format(timestring,...). Instead, the input timestring needs to be wrapped in to_timestamp(timestring, 'yyyy-MM-dd HH:mm:ss.SSS'). So the complete query will be a bit lengthy like:

date_format(to_timestamp(timestring, 'yyyy-MM-dd HH:mm:ss.SSS'), 'yyyy-MM-dd HH:mm:ss.SSS')

ttnghia avatar Aug 22 '22 20:08 ttnghia

Are we going to support “MMMM" along with "SSS" ?

!Expression <DateFormatClass> date_format(timestampF#857, MMMM, Some(UTC)) cannot run on GPU because Failed to convert Unsupported word: MMMM null

johnnyzhon avatar Sep 19 '22 01:09 johnnyzhon

@johnnyzhon

Are we going to support “MMMM" along with "SSS" ?

!Expression date_format(timestampF#857, MMMM, Some(UTC)) cannot run on GPU because Failed to convert Unsupported word: MMMM null

"MMMM" is very different from "SSS". If you need/want support for it, please file a separate feature request. "SSS" works because CUDF already has support for some sub-second formatting. MMMM is for the month of the year as a String, not a number. CUDF does not support this directly, and there is also the potential for the need to localize the output. All of this makes it much more complicated to do.

revans2 avatar Sep 19 '22 14:09 revans2