dbt-databricks
dbt-databricks copied to clipboard
Cache error for snapshots on top of parquet files
Describe the bug
When running a dbt snapshot on top of an underlying parquet data source, there is a challenge introduced in particular when columns are added / removed at the source. Note that this environment is NOT running with Unity Catalog yet. I'm not sure if that has an impact, but feels relevant to mention.
Steps To Reproduce
- Create a
sourceon a parquet table in cloud storage / the lake - Run
dbt snapshot - Update the columns in the source table
- Run
dbt snapshotagain and observe the error message below
NOTE: If dbt snapshot is run a THIRD time, it works. This makes it a bit hard to understand the error message's reference to "restarting the cluster", because that doesn't seem to be strictly necessary.
Expected behavior
Snapshot works on the second run.
Screenshots and log output
Error while reading file s3://udemy-sd-classification/sd_classification/course_subcategory/part-00000-850de77d-0923-422b-8607-ae6f83e9a29e-c000.snappy.parquet. [DEFAULT_FILE_NOT_FOUND] It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
System information
The output of dbt --version:
1.5.5
The operating system you're using: dbt Cloud
Have you tried with Delta (just to try to scope the bug)? I'm unable to get the error you're seeing, but then again, I'm trying to recreate in the dbt test harness. I get a different error instead, related to missing column, and I get it regardless of file format.
hey @benc-db , We don't see the error on Delta sources. And it's not a column modification issue. Table metadata stays the same. Here is the whole story.
Steps To Reproduce
- Create a
sourceon a parquet table in cloud storage / the lake ( ParquetTable1 ) - Run
dbt snapshot(just for initially creating a snapshot table) - Run
dbt snapshota. dbt-snapshot materialization creates viewParquetTable1__dbt_tmpb. dbt-snapshot materialization runs aMERGEstatement.. target:ParquetTable1_snapshotsource:ParquetTable__dbt_tmp(view)
between 3.a. and 3.b. another process (not in Databricks) re-populates the table ParquetTable1 (with SparkSQL, insert-override). The underlying s3 parquet files are changing as expected. It will be a bit hard to reproduce the issue because we need to modify/remove the s3 file exactly between 3.a. and 3.b. and there are a few seconds only :)
more context;
ParquetTable1is not a partitioned table, refreshed hourly with a non-databricks spark Job.- We are using hive_metastore.
- dbt-databricks working with DB SQL Warehouse.
- we tried to add
refresh table ParquetTable1as a pre-hook command to the snapshot, but it didn't help as it ran before 3.a.
cc/ @matt-winkler