delta
delta copied to clipboard
[BUG] VACUUM on a change data feed table
Bug
My VACUUM
command on a job is causing this issue.
Getting the error message:
Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3a://REDACTED_BUCKET_NAME/REDACTED_TABLE_NAME/data/_change_data/cdc-00000-bd542b2a-7c0c-487e-a594-8b3fceff97cp.c000.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement. For more information, see https://docs.databricks.com/delta/delta-intro.html#frequently-asked-questions
Describe the problem
I've got a suspicion that the VACUUM
cleaned up the files no longer referenced in the latest table and also removed the relevant data in the _change_data
folder but the transaction log wasn't changed.
I know that the transaction log keeps history for 30 days.
When we set spark.sql.files.ignoreMissingFiles
to true
we don't get the error anymore.
Is this the solution here?
Environment information
- DBR runtime: 10.4 LTS
- Delta Lake version: ?
- Spark version: Apache Spark 3.2.1
- Scala version: Scala 2.12
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?
- [ ] Yes. I can contribute a fix for this bug independently.
- [X] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
- [ ] No. I cannot contribute a bug fix at this time.
I've got a suspicion that the VACUUM cleaned up the files no longer referenced in the latest table and also removed the relevant data in the _change_data folder but the transaction log wasn't changed.
This is what VACUUM
does. See the docs.
vacuum deletes only data files, not log files. Log files are deleted automatically and asynchronously after checkpoint operations.
What operation did you run after VACUUM that caused the FileReadException
? You shouldn't be trying to read data after you have VACUUM'd it.
Thanks for getting back to me @scottsand-db.
The reason I ran VACUUM
was because querying the table changes was taking a long time.
Running select count(*) from table_changes('table_name', 0)
or similar was taking minutes to respond.
There were roughly 420 million records in the _change_data
folder for all of time.
I wanted to actually sort a subset of the data but the size was ridiculous and I realised outdated rows could go.
Which is why I ran VACUUM
.
After running the VACUUM
, running queries like:
select * from table_changes('table_name', 15, 17)
started to throw that exception.
Simplified Scenario
Suppose you have a table with the maximum commit version of 30
and you had previously inserted a record at commit version of 15
and this record remains unmodified forever from there.
After running VACUUM
, would you expect select * from table_changes('table_name', 15, 17)
to return that INSERT operation?
Thanks
The reason I ran
VACUUM
was because querying the table changes was taking a long time.
Just to set the right expectation. VACUUM
will not speed up any queries. It's just an operation to save your storage cost.
I wanted to actually sort a subset of the data but the size was ridiculous and I realised outdated rows could go.
It sounds to me that you just need to DELETE the data? E.g., just run SQL DELETE queries.
After running VACUUM, would you expect select * from table_changes('table_name', 15, 17) to return that INSERT operation?
It depends whether VACUUM
deletes the files created between version 15 and 17. Your query asks for all changes between version 15 and 17, so files created between version 15 and 17 must exist.
It depends whether VACUUM deletes the files created between version 15 and 17. Your query asks for all changes between version 15 and 17, so files created between version 15 and 17 must exist.
Yeah okay fair enough, that makes sense.
I guess if we're using table changes we can't just run VACUUM
without any consideration for time travel (as the docs suggest) and for change data feed.
Thanks for the responses, happy to close this issue.