flink: after rewrite, the two small files are rewritten into the same two small files as before
Here is a table without partitions with two data files in parquet format:
- a.parquet 200M
- b.parquet 250M
I started a task of flink's rewriteDataFiles, targetSize is set to 300M. After waiting for the rewrite to complete, two new files are generated. These two new files are exactly the same as the previous two old files. This situation should not generate new data again. A bit of a waste of performance...
This is something that will be fixed a little in the latest release, one of our issues is that the algorithm basically could not handle compacting files which were not of the correct size, but when combined would be too large. Now Files are combined at the offset level so more efficient compaction should be possible. Additionally I believe we have other work that goes through and checks to see if any tasks for rewrite are generated that would end up being a noop and skips them.
#3292
Ah sorry, I missed this is in the flink implementation but I believe the fix may apply for that engine as well since it's in the split planning code.
@RussellSpitzer Thank you for your answer. I incorporated the pr you mentioned, but the phenomenon persists. I think this might not be the same problem as yours. I followed the process and the problem seems to be in the isPartialFileScan function. This function is fine if using avro format data files. However, when using data files in parquet format, since the initial state of the parquet file itself has a 4-byte offset, the judgment here is wrong, even if it is a complete parquet file, it will return true here.
Should a judgment be made on the file format here? If it is in parquet format, we need to add an initial offset of 4 bytes to fileScanTask.length().
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'