iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Unexpected FileIO `remove_all` behavior for S3

Open c-thiel opened this issue 9 months ago • 14 comments

Apache Iceberg Rust version

main branch

Describe the bug

Currently when using the S3 FileIO's remove_all method we noticed the following behavior: Running remove_all('s3://foo/bar') removes s3://foo/bar* so also siblings such as s3://foo/bar-baz.

I would expect fileio to understand that I provide a folder, and only this folder and subfolders are removed, not siblings that have the same prefix.

@Xuanwo, @liurenjie1024, @Fokko do you agree? If so should we fix it in iceberg-rust or opendal?

To Reproduce

No response

Expected behavior

s3://foo/bar/* is deleted, not s3://foo/bar*

Willingness to contribute

I can contribute a fix for this bug independently

c-thiel avatar Mar 20 '25 17:03 c-thiel

ADLS Gen2 behaves differently. Even without a trailing slash siblings are not deleted. I think we should at least have consistent behavior, I would opt for the one described above.

The question remains what behavior is desired in opendal

c-thiel avatar Mar 20 '25 17:03 c-thiel

I guess this is opendal's behavior, and it seems reasonable to append "/" at end if not present. cc @Xuanwo

liurenjie1024 avatar Mar 21 '25 02:03 liurenjie1024

OpenDAL treats input path as a prefix so remove_all("abc") will remove all keys that starts "abc" so it's expected that "abcd" will also be removed.

The ADLS Gen2 behavior is not expected to me, it might be a bug that need to address.

Back to iceberg, if we want to ensure that remove_all is just dir, we can always append a / to the path.

Xuanwo avatar Mar 21 '25 02:03 Xuanwo

Hi, @c-thiel, what’s your expected behavior? I’m happy to help resolve this.

Xuanwo avatar Mar 24 '25 07:03 Xuanwo

@Xuanwo I would slightly tend to only remove "abc" in your example, because its the safe default. My main point is however that behavior should be consistent across storages.

GCS works like Azure. So out of those three, S3 is the odd one out.

c-thiel avatar Mar 27 '25 15:03 c-thiel

I think if a client calls remove_all(table.location), this should work and be safe. So I would argue we should treat the input always as a directory / append a path.

c-thiel avatar Mar 27 '25 15:03 c-thiel

Thank you @c-thiel, I will get this fixed.

And also, I confirmed that the opendal behavior in ADLS Gen2 is wrong. I will take another look.

GCS works like Azure. So out of those three, S3 is the odd one out.

This is a surprise. Have you enabled gcs's Hierarchical Namespace for the bucket?

Xuanwo avatar Mar 28 '25 09:03 Xuanwo

I think if a client calls remove_all(table.location), this should work and be safe. So I would argue we should treat the input always as a directory / append a path.

I have a suggestion that rename this API to remove_dir_all which indicate that it only removes a dir recursively. And in this API, we make sure that both abc and abc/ works correctly. What do you think?

This also aligns with std's remove_dir_all.

Or are you expecting us to first check if abc/ exists and then decide whether to remove a prefix or a directory?

I'm fine with both approaches at the iceberg FileIO side.

Xuanwo avatar Mar 28 '25 09:03 Xuanwo

I think if a client calls remove_all(table.location), this should work and be safe. So I would argue we should treat the input always as a directory / append a path.

I have a suggestion that rename this API to remove_dir_all which indicate that it only removes a dir recursively. And in this API, we make sure that both abc and abc/ works correctly. What do you think?

This also aligns with std's remove_dir_all.

Or are you expecting us to first check if abc/ exists and then decide whether to remove a prefix or a directory?

I'm fine with both approaches at the iceberg FileIO side.

+1 for following std's naming convention to have different methods for remove_dir_all and remove_dir(for empty dir)

liurenjie1024 avatar Mar 31 '25 13:03 liurenjie1024

Nice, let me handle this.

Xuanwo avatar Apr 07 '25 09:04 Xuanwo

cc @Xuanwo Are we going to fix this ? It's the last issue in 0.5.0 release

liurenjie1024 avatar Apr 21 '25 07:04 liurenjie1024

cc @Xuanwo Are we going to fix this ? It's the last issue in 0.5.0 release

Yep, let me do this in this week.

Xuanwo avatar Apr 21 '25 08:04 Xuanwo

cc @Xuanwo Do you have time to work on this? I can handle it if you are busy.

liurenjie1024 avatar Apr 29 '25 06:04 liurenjie1024

cc @Xuanwo Do you have time to work on this? I can handle it if you are busy.

Will finish this today 😚

Xuanwo avatar Apr 29 '25 06:04 Xuanwo