delta icon indicating copy to clipboard operation
delta copied to clipboard

[BUG] VACUUM on a change data feed table

Open pavs23 opened this issue 2 years ago • 1 comments

Bug

My VACUUM command on a job is causing this issue.

Getting the error message:

Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3a://REDACTED_BUCKET_NAME/REDACTED_TABLE_NAME/data/_change_data/cdc-00000-bd542b2a-7c0c-487e-a594-8b3fceff97cp.c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions

Describe the problem

I've got a suspicion that the VACUUM cleaned up the files no longer referenced in the latest table and also removed the relevant data in the _change_data folder but the transaction log wasn't changed.

I know that the transaction log keeps history for 30 days.

When we set spark.sql.files.ignoreMissingFiles to true we don't get the error anymore. Is this the solution here?

Environment information

  • DBR runtime: 10.4 LTS
  • Delta Lake version: ?
  • Spark version: Apache Spark 3.2.1
  • Scala version: Scala 2.12

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • [ ] Yes. I can contribute a fix for this bug independently.
  • [X] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • [ ] No. I cannot contribute a bug fix at this time.

pavs23 avatar Sep 01 '22 04:09 pavs23

I've got a suspicion that the VACUUM cleaned up the files no longer referenced in the latest table and also removed the relevant data in the _change_data folder but the transaction log wasn't changed.

This is what VACUUM does. See the docs.

vacuum deletes only data files, not log files. Log files are deleted automatically and asynchronously after checkpoint operations.

What operation did you run after VACUUM that caused the FileReadException? You shouldn't be trying to read data after you have VACUUM'd it.

scottsand-db avatar Sep 08 '22 19:09 scottsand-db

Thanks for getting back to me @scottsand-db. The reason I ran VACUUM was because querying the table changes was taking a long time.

Running select count(*) from table_changes('table_name', 0) or similar was taking minutes to respond. There were roughly 420 million records in the _change_data folder for all of time.

I wanted to actually sort a subset of the data but the size was ridiculous and I realised outdated rows could go. Which is why I ran VACUUM.

After running the VACUUM, running queries like: select * from table_changes('table_name', 15, 17) started to throw that exception.

Simplified Scenario Suppose you have a table with the maximum commit version of 30 and you had previously inserted a record at commit version of 15 and this record remains unmodified forever from there.

After running VACUUM, would you expect select * from table_changes('table_name', 15, 17) to return that INSERT operation?

Thanks

pavs23 avatar Sep 14 '22 11:09 pavs23

The reason I ran VACUUM was because querying the table changes was taking a long time.

Just to set the right expectation. VACUUM will not speed up any queries. It's just an operation to save your storage cost.

I wanted to actually sort a subset of the data but the size was ridiculous and I realised outdated rows could go.

It sounds to me that you just need to DELETE the data? E.g., just run SQL DELETE queries.

After running VACUUM, would you expect select * from table_changes('table_name', 15, 17) to return that INSERT operation?

It depends whether VACUUM deletes the files created between version 15 and 17. Your query asks for all changes between version 15 and 17, so files created between version 15 and 17 must exist.

zsxwing avatar Sep 14 '22 16:09 zsxwing

It depends whether VACUUM deletes the files created between version 15 and 17. Your query asks for all changes between version 15 and 17, so files created between version 15 and 17 must exist.

Yeah okay fair enough, that makes sense.

I guess if we're using table changes we can't just run VACUUM without any consideration for time travel (as the docs suggest) and for change data feed.

Thanks for the responses, happy to close this issue.

pavs23 avatar Sep 16 '22 01:09 pavs23