delta-rs
delta-rs copied to clipboard
'list_objs` behaviour is not uniform among backends
Bug
The file system and Azure list_objs
implementations return only objects within the directory specified by path
which differs with the S3 implementation which returns all objects that have a prefix of path
. These issues were uncovered when implementing additional tests for vacuum in #669.
Looking at the doc string for the storage backend it states: /// Return a list of objects by path prefix in an async stream.
so it seems like the correct behavior is implemented by S3.
Looking for confirmation on what is expected so we can resolve the outliers.
@Blajda - thats for the report - seems I missed that in the last refactor ... added a PR for a fix #673
Hey @roeap I think this issue goes a bit deeper than that PR. I created a test to demonstrate the differences between s3 and the other backends in this branch. https://github.com/delta-io/delta-rs/compare/main...Blajda:delta-rs:backend_issue_tests
list_objs
only returns two items for the localfs and adls which are the single file and the directory but S3 will return 3 files which belong in the directory created by the test.
This is the output for each run:
Local FS
[ObjectMeta { path: "/tmp/delta-rs_tests.TSJeiiWZyEKi/dir", modified: 2022-07-06T02:35:03.208982172Z, size: Some(80) }, ObjectMeta { path: "/tmp/delta-rs_tests.TSJeiiWZyEKi/first_level", modified: 2022-07-06T02:35:03.208982172Z, size: Some(4) }]
S3
[ObjectMeta { path: "s3://deltars//dir/second_level", modified: 2022-07-06T02:20:04Z, size: Some(5) }, ObjectMeta { path: "s3://deltars//dir/second_level2", modified: 2022-07-06T02:20:04Z, size: Some(5) }, ObjectMeta { path: "s3://deltars//first_level", modified: 2022-07-06T02:20:03Z, size: Some(4) }]
ADLS (Note path is broken for adls I think your PR fixes this particular issue) but the key point here is only 2 items are in the list.
[ObjectMeta { path: "adls2://deltarstest/delta-rs-test-1657074779-18132", modified: 2022-07-06T02:33:00Z, size: Some(0) }, ObjectMeta { path: "adls2://deltarstest/delta-rs-test-1657074779-18132", modified: 2022-07-06T02:33:00Z, size: Some(4) }]
@Blajda - sorry it was a bit too late yesterday, and I had not worked carefully enough :). I updated the PR to recursively traverse the "directories" and also removed the claim that it would fix this issue completely :).
I do remember now the source of this confusion ... When implementing the list_dir
we were thinking about the delta log mostly. There we do not need recursive traversals - so both local and azure are not traversing into the directories. Not entirely sure if vacuum is the only one where we actually list the data files - but i "think" it is.
sorry it was a bit too late yesterday, and I had not worked carefully enough :)
No worries @roeap. Let me know if I should make a PR for the Integration framework outlined here: https://github.com/delta-io/delta-rs/compare/main...Blajda:delta-rs:backend_issue_tests I think it would be helpful to have write a single test which can be executed by multiple backends to ensure consistent interfaces. Would be useful for future backends.
That integration framework looks really cool! I think this could be very useful to us!
One ting to consider is, that we are planning to move to a different object_strore implementation all toghether. Datafusion already model to using the object_strore_rs crate (https://github.com/influxdata/object_store_rs), and once we en-par with the autorization options, I think we will start intherating this over here as well. This should give us very consistent behaviour across all object stores as well a an abstration over object store paths.
A framework to set up different environemts for the specific backends is helpful as well .. the more operations we do the better to integration test them all against the various backends ...
@Blajda @wjones127 - after merging the vacuum test suite and the preceding PRs, this is resolved, right?