Unexpected FileIO `remove_all` behavior for S3
Apache Iceberg Rust version
main branch
Describe the bug
Currently when using the S3 FileIO's remove_all method we noticed the following behavior:
Running remove_all('s3://foo/bar') removes s3://foo/bar* so also siblings such as s3://foo/bar-baz.
I would expect fileio to understand that I provide a folder, and only this folder and subfolders are removed, not siblings that have the same prefix.
@Xuanwo, @liurenjie1024, @Fokko do you agree? If so should we fix it in iceberg-rust or opendal?
To Reproduce
No response
Expected behavior
s3://foo/bar/* is deleted, not s3://foo/bar*
Willingness to contribute
I can contribute a fix for this bug independently
ADLS Gen2 behaves differently. Even without a trailing slash siblings are not deleted. I think we should at least have consistent behavior, I would opt for the one described above.
The question remains what behavior is desired in opendal
I guess this is opendal's behavior, and it seems reasonable to append "/" at end if not present. cc @Xuanwo
OpenDAL treats input path as a prefix so remove_all("abc") will remove all keys that starts "abc" so it's expected that "abcd" will also be removed.
The ADLS Gen2 behavior is not expected to me, it might be a bug that need to address.
Back to iceberg, if we want to ensure that remove_all is just dir, we can always append a / to the path.
Hi, @c-thiel, what’s your expected behavior? I’m happy to help resolve this.
@Xuanwo I would slightly tend to only remove "abc" in your example, because its the safe default. My main point is however that behavior should be consistent across storages.
GCS works like Azure. So out of those three, S3 is the odd one out.
I think if a client calls remove_all(table.location), this should work and be safe. So I would argue we should treat the input always as a directory / append a path.
Thank you @c-thiel, I will get this fixed.
And also, I confirmed that the opendal behavior in ADLS Gen2 is wrong. I will take another look.
GCS works like Azure. So out of those three, S3 is the odd one out.
This is a surprise. Have you enabled gcs's Hierarchical Namespace for the bucket?
I think if a client calls
remove_all(table.location), this should work and be safe. So I would argue we should treat the input always as a directory / append a path.
I have a suggestion that rename this API to remove_dir_all which indicate that it only removes a dir recursively. And in this API, we make sure that both abc and abc/ works correctly. What do you think?
This also aligns with std's remove_dir_all.
Or are you expecting us to first check if abc/ exists and then decide whether to remove a prefix or a directory?
I'm fine with both approaches at the iceberg FileIO side.
I think if a client calls
remove_all(table.location), this should work and be safe. So I would argue we should treat the input always as a directory / append a path.I have a suggestion that rename this API to
remove_dir_allwhich indicate that it only removes a dir recursively. And in this API, we make sure that bothabcandabc/works correctly. What do you think?This also aligns with std's
remove_dir_all.Or are you expecting us to first check if
abc/exists and then decide whether to remove a prefix or a directory?I'm fine with both approaches at the iceberg FileIO side.
+1 for following std's naming convention to have different methods for remove_dir_all and remove_dir(for empty dir)
Nice, let me handle this.
cc @Xuanwo Are we going to fix this ? It's the last issue in 0.5.0 release
cc @Xuanwo Are we going to fix this ? It's the last issue in 0.5.0 release
Yep, let me do this in this week.
cc @Xuanwo Do you have time to work on this? I can handle it if you are busy.
cc @Xuanwo Do you have time to work on this? I can handle it if you are busy.
Will finish this today 😚