spark-rapids
spark-rapids copied to clipboard
[FEA] Avoid CPU fallback due to date_format:Failed to convert Unsupported word: SSS null.
I wish we can avoid CPU fallback due to date_format:Failed to convert Unsupported word: SSS null.
Reproduce:
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.types._
var df = spark.sparkContext.parallelize(Seq(1)).toDF()
df=df.withColumn("value82", (lit("123456.78").cast(DecimalType(8,2)))).
withColumn("value63", (lit("123.456").cast(DecimalType(6,3)))).
withColumn("value1510", (lit("12345.0123456789").cast(DecimalType(15,10)))).
withColumn("value2510", (lit("123456789012345.0123456789").cast(DecimalType(25,10)))).
withColumn("value2901", (lit("1234567890123456789012345678.1").cast(DecimalType(29,1)))).
withColumn("value3802", (lit("123456789012345678901234567890123456.01").cast(DecimalType(38,2)))).
withColumn("timestring", (lit("1997-02-28 10:30:00.012")))
df.write.format("parquet").mode("overwrite").save("/tmp/df.parquet")
df=spark.read.parquet("/tmp/df.parquet")
df.createOrReplaceTempView("df")
spark.sql("SELECT date_format(timestring,'yyyy-MM-dd HH:mm:ss.SSS') FROM df").collect
Not-supported-messages:
!Expression <DateFormatClass> date_format(cast(timestring#494 as timestamp), yyyy-MM-dd HH:mm:ss.SSS, Some(Etc/UTC)) cannot run on GPU because Failed to convert Unsupported word: SSS null
Looking at the new docs for parsing strings to timestamps
https://github.com/rapidsai/cudf/blob/e099e01c9b6ab8a2db5d5ee446b8843ee6199acc/cpp/include/cudf/strings/convert/convert_datetime.hpp#L62-L66
It looks like we might be able to convert SSS to %3f, because each S corresponds to a new factional digit of a second, just like the number in between the % and the f does on the formatting. This would still need to do a bunch of testing to be sure that if there is rounding that we match it/etc. But at least for parsing numbers we might be able to do this.
@sameerz @viadea Please pay attention to https://github.com/NVIDIA/spark-rapids/issues/6375. Although now we can support SSS, the example on top of this issue still can't run directly like date_format(timestring,...). Instead, the input timestring needs to be wrapped in to_timestamp(timestring, 'yyyy-MM-dd HH:mm:ss.SSS'). So the complete query will be a bit lengthy like:
date_format(to_timestamp(timestring, 'yyyy-MM-dd HH:mm:ss.SSS'), 'yyyy-MM-dd HH:mm:ss.SSS')
Are we going to support “MMMM" along with "SSS" ?
!Expression <DateFormatClass> date_format(timestampF#857, MMMM, Some(UTC)) cannot run on GPU because Failed to convert Unsupported word: MMMM null
@johnnyzhon
Are we going to support “MMMM" along with "SSS" ?
!Expression date_format(timestampF#857, MMMM, Some(UTC)) cannot run on GPU because Failed to convert Unsupported word: MMMM null
"MMMM" is very different from "SSS". If you need/want support for it, please file a separate feature request. "SSS" works because CUDF already has support for some sub-second formatting. MMMM is for the month of the year as a String, not a number. CUDF does not support this directly, and there is also the potential for the need to localize the output. All of this makes it much more complicated to do.