iceberg Core: Use min sequence number on each partition to remove old delete files

In https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java#L137, we use the min sequence number to filter the delete-manifests, then we can find the older delete file and drop them.

But if we have some cold partition that never compaction again, then the min sequence number will never change. That makes the older delete file in other partitions will never be dropped. This will be worse and worse. It will make the spark driver oom and finally, the table becomes unavailable.

In this PR, I use a map that contains a min sequence number on each partition instead of the global min sequence number. But I found the code changed a lot. I am not sure it's a good solution.

And this PR can solve another problem. Currently, when we enabled partial compaction, we didn't drop the delete files as well. Because we can not make sure the delete files are not referenced by other data files when commit replace. That will leave a lot of useless delete files.

Jan 27 '22 02:01 coolderli

@rdblue Could you please review this？Are there any other better solutions？

Jan 27 '22 08:01 coolderli

@rdblue @jackye1995 Could you take a look at this, please? I'm not sure if this is the right direction. Thanks.

Feb 07 '22 03:02 coolderli

Thanks for the pointer. This looks like it reads each manifest to add a cache of partitions in memory , but the main concern is maybe that commit() should be fast and not OOM if there are a lot of partitions.. Probably the best way is to do a separate Spark job?

Feb 16 '22 05:02 szehon-ho

Thanks for the pointer. This looks like it reads each manifest to add a cache of partitions in memory , but the main concern is maybe that commit() should be fast and not OOM if there are a lot of partitions.. Probably the best way is to do a separate Spark job?

Yes, I can't agree anymore. Maybe a rewrite delete manifest should be considered.

Feb 16 '22 06:02 coolderli

@coolderli， I used the source code of 0.13.1 to modify, compile and test it. It's basically no problem. However, in the test, I found that some partitions failed to completely clean up the deleted files. I looked at it. The reason is that the serial number of the data files is smaller than that of the deleted files. The merge is run every 2 hours, and it is not merged? Is there no need to merge because the primary key of the data file does not appear in the deleted file? When Flink writes data, it turns on the upsert mode. In this way, if only pure insert data will also generate delete delete files?

In addition, I also downloaded version 0.14.0, and found that this part of the modification is relatively large. The original modification seems to be inapplicable. Is there a PR to solve this problem?

我用0.13.1的源码修改编译测了下，基本没问题。不过在测试中发现有部分分区出现删除文件没有完全清理掉的情况，我看了下，是因为有数据文件的序号比删除文件的小，合并是2小时跑一次，没有合并到？是否因为数据文件的主键没有出现在删除文件中，所以一直不需要合并？flink写数据过来的时候，开启了upsert模式，这样如果只是纯insert的数据也会产生delete的删除文件吗？另外我也下载了0.14.0版，发现这部分修改比较大，原修改好像不适用了，是否有解决此问题的pr？

Aug 05 '22 03:08 peterxiong13

@coolderli， I used the source code of 0.13.1 to modify, compile and test it. It's basically no problem. However, in the test, I found that some partitions failed to completely clean up the deleted files. I looked at it. The reason is that the serial number of the data files is smaller than that of the deleted files. The merge is run every 2 hours, and it is not merged? Is there no need to merge because the primary key of the data file does not appear in the deleted file? When Flink writes data, it turns on the upsert mode. In this way, if only pure insert data will also generate delete delete files?

In addition, I also downloaded version 0.14.0, and found that this part of the modification is relatively large. The original modification seems to be inapplicable. Is there a PR to solve this problem?

我用0.13.1的源码修改编译测了下，基本没问题。不过在测试中发现有部分分区出现删除文件没有完全清理掉的情况，我看了下，是因为有数据文件的序号比删除文件的小，合并是2小时跑一次，没有合并到？是否因为数据文件的主键没有出现在删除文件中，所以一直不需要合并？flink写数据过来的时候，开启了upsert模式，这样如果只是纯insert的数据也会产生delete的删除文件吗？另外我也下载了0.14.0版，发现这部分修改比较大，原修改好像不适用了，是否有解决此问题的pr？

@peterxiong13 We have no pr for 0.14.0. The deleted file is not removed because there is the sequence number of some equality delete files that are identical to the data files. These deleted files will not be deleted when selecting the files that smaller than the min sequence number.

我们暂时没有 0.14.0 的 pr。有些文件没有完全清除掉是有一部分 equality delete file 的 sequence number 和 min sequence number相同。这部分文件在选择小于 min sequence number 的文件时没有被删掉。

Aug 05 '22 03:08 coolderli

@coolderli， I used the source code of 0.13.1 to modify, compile and test it. It's basically no problem. However, in the test, I found that some partitions failed to completely clean up the deleted files. I looked at it. The reason is that the serial number of the data files is smaller than that of the deleted files. The merge is run every 2 hours, and it is not merged? Is there no need to merge because the primary key of the data file does not appear in the deleted file? When Flink writes data, it turns on the upsert mode. In this way, if only pure insert data will also generate delete delete files? In addition, I also downloaded version 0.14.0, and found that this part of the modification is relatively large. The original modification seems to be inapplicable. Is there a PR to solve this problem? 我用0.13.1的源码修改编译测了下，基本没问题。不过在测试中发现有部分分区出现删除文件没有完全清理掉的情况，我看了下，是因为有数据文件的序号比删除文件的小，合并是2小时跑一次，没有合并到？是否因为数据文件的主键没有出现在删除文件中，所以一直不需要合并？flink写数据过来的时候，开启了upsert模式，这样如果只是纯insert的数据也会产生delete的删除文件吗？另外我也下载了0.14.0版，发现这部分修改比较大，原修改好像不适用了，是否有解决此问题的pr？

@peterxiong13 We have no pr for 0.14.0. The deleted file is not removed because there is the sequence number of some equality delete files that are identical to the data files. These deleted files will not be deleted when selecting the files that smaller than the min sequence number.

我们暂时没有 0.14.0 的 pr。有些文件没有完全清除掉是有一部分 equality delete file 的 sequence number 和 min sequence number相同。这部分文件在选择小于 min sequence number 的文件时没有被删掉。

Thank you very much for your reply. Does the distribution need to meet the function of partition evolution by comparing the global serial number? However, the longer the comparison time according to the global sequence number, the more deleted files will be left, and the whole table must be rewritten before it can be cleaned up. We plan to use iceberg to save historical data for a long time, which is not appropriate. I hope the community can fix this problem as soon as possible and clear the obstacles for the development of iceberg

非常感谢您的回复。发行版按照全局序号比较是否因为需要满足分区演进的功能呢？但按全局序号比较时间越长遗留的删除文件就会越多，必须整表重写才会清理掉。我们打算用iceberg保存长时间的历史数据，这样就不太合适了。希望社区能尽快修复此问题，为iceberg的发展扫清障碍

Aug 05 '22 06:08 peterxiong13

iceberg iceberg copied to clipboard

Core: Use min sequence number on each partition to remove old delete files

iceberg
iceberg copied to clipboard