Don't fallback to CPU when referencing hidden "_metadata" column.
We currently do not support "_metadata" columns and fall back to the CPU if we see it (this is for Spark 3.3.0).
scala> spark.read.parquet("./target/DF").selectExpr("*", "_metadata").show(truncate=false)
23/01/04 16:31:07 WARN GpuOverrides:
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
@Partitioning <SinglePartition$> could run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> cast(a#64 as string) AS a#80 will run on GPU
*Expression <Cast> cast(a#64 as string) will run on GPU
*Expression <Alias> cast(_metadata#70 as string) AS _metadata#83 will run on GPU
*Expression <Cast> cast(_metadata#70 as string) will run on GPU
*Exec <ProjectExec> will run on GPU
*Expression <Alias> named_struct(file_path, file_path#88, file_name, file_name#89, file_size, file_size#90L, file_modification_time, file_modification_time#91) AS _metadata#70 will run on GPU
*Expression <CreateNamedStruct> named_struct(file_path, file_path#88, file_name, file_name#89, file_size, file_size#90L, file_modification_time, file_modification_time#91) will run on GPU
!Exec <FileSourceScanExec> cannot run on GPU because hidden metadata columns are not supported on GPU
I am not sure about row_index. I don't see that in any of the PRs for the spark issue. It looks like we actually do all of the work for this on the GPU already, so it might be worth not falling back to the CPU and adding in some tests to cover it, but that is a separate issue.
Originally posted by @revans2 in https://github.com/NVIDIA/spark-rapids/issues/7452#issuecomment-1371160023
When we first wrote the FileSourceScanExec code we fell back to the CPU if we saw a "_metadata" column in the output. But looking at how that "_metadata" column is produced it looks like we should be able to support it. We just need to understand a bit better exactly what is happening here.
In the context of Delta the most important file metadata column is row_index. When it is generated by the underlying parquet format reader, Delta scan with DV is splitable https://github.com/delta-io/delta/pull/2933
Lowered the priority and moved to 26.02 because we are changing the physical plan in #13843 to drop the file metadata on the GPU callpath for now.