[BUG] Can't access Unity Catalogue data on Databricks AWS cluster
Describe the bug I have set up RAPIDS on Databricks AWS cluster (runtime 12.2.x-gpu-ml-scala2.12) as described in https://docs.nvidia.com/spark-rapids/user-guide/23.12.2/getting-started/databricks.html I then trying to read delta lake on the Unity Catalogue (we have it enabled as it's Databricks main data catalogue offering)
data = spark.read.format("delta").table("table name").toPandad()
and get access denied error:
: org.apache.spark.SparkException: Exception thrown in awaitResult: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 4264) (172.17.190.255 executor 2): java.nio.file.AccessDeniedException: s3://categories/8i/part-00073-d55722a3-743e-4b23-93eb-d264d9f0b897.c000.snappy.parquet: getFileStatus on s3://categories/8i/part-00073-d55722a3-743e-4b23-93eb-d264d9f0b897.c000.snappy.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden
Steps/Code to reproduce bug [Please provide a list of steps or a code sample to reproduce the issue. Avoid posting private or sensitive data.]
Set RAPIDS on 12.2.x-gpu-ml-scala2.12 Databricks AWS cluster runtime using
spark.rapids.sql.concurrentGpuTasks 2 spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-23.12.2.jar:/databricks/spark/python spark.rapids.sql.python.gpu.enabled true spark.rapids.memory.pinnedPool.size 2G spark.rapids.sql.format.parquet.reader.type PERFILE spark.task.resource.gpu.amount 0.1 spark.plugins com.nvidia.spark.SQLPlugin spark.python.daemon.module rapids.daemon_databricks
pip install cudf-cu11==23.12.1
read any delta lake table on unity catalogue
Expected behavior As Unity Catalogue is Databricks main data catalogue offering and enabled by default, I would expect RAPIDS support unity data access. Without RAPIDS config cluster can access Unity Catalogue normally.
Additional context Add any other context about the problem here.
while this error looks different, it might be related to https://github.com/NVIDIA/spark-rapids/issues/10318
@captify-sivakhno Could you please provide the details of the table? You can retrieve the information by executing either spark.sql(f'DESCRIBE FORMATTED unity_catalog_name.schema_name.table_name').show(truncate=False) in Spark or DESCRIBE FORMATTED unity_catalog_name.schema_name.table_name; in SQL Editor. Can you also share the complete stack trace of the error?
@captify-sivakhno What is the logic of this use case? I saw you firstly read a Spark dataframe and then convert it to a pandas dataframe? Does it mean you plan to do some transformation on pandas dataframe?