delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

'list_objs` behaviour is not uniform among backends

Open Blajda opened this issue 2 years ago • 7 comments

Bug

The file system and Azure list_objs implementations return only objects within the directory specified by path which differs with the S3 implementation which returns all objects that have a prefix of path. These issues were uncovered when implementing additional tests for vacuum in #669.

Looking at the doc string for the storage backend it states: /// Return a list of objects by path prefix in an async stream. so it seems like the correct behavior is implemented by S3.

Looking for confirmation on what is expected so we can resolve the outliers.

Blajda avatar Jul 04 '22 22:07 Blajda

@Blajda - thats for the report - seems I missed that in the last refactor ... added a PR for a fix #673

roeap avatar Jul 05 '22 20:07 roeap

Hey @roeap I think this issue goes a bit deeper than that PR. I created a test to demonstrate the differences between s3 and the other backends in this branch. https://github.com/delta-io/delta-rs/compare/main...Blajda:delta-rs:backend_issue_tests list_objs only returns two items for the localfs and adls which are the single file and the directory but S3 will return 3 files which belong in the directory created by the test.

This is the output for each run: Local FS [ObjectMeta { path: "/tmp/delta-rs_tests.TSJeiiWZyEKi/dir", modified: 2022-07-06T02:35:03.208982172Z, size: Some(80) }, ObjectMeta { path: "/tmp/delta-rs_tests.TSJeiiWZyEKi/first_level", modified: 2022-07-06T02:35:03.208982172Z, size: Some(4) }]

S3 [ObjectMeta { path: "s3://deltars//dir/second_level", modified: 2022-07-06T02:20:04Z, size: Some(5) }, ObjectMeta { path: "s3://deltars//dir/second_level2", modified: 2022-07-06T02:20:04Z, size: Some(5) }, ObjectMeta { path: "s3://deltars//first_level", modified: 2022-07-06T02:20:03Z, size: Some(4) }]

ADLS (Note path is broken for adls I think your PR fixes this particular issue) but the key point here is only 2 items are in the list. [ObjectMeta { path: "adls2://deltarstest/delta-rs-test-1657074779-18132", modified: 2022-07-06T02:33:00Z, size: Some(0) }, ObjectMeta { path: "adls2://deltarstest/delta-rs-test-1657074779-18132", modified: 2022-07-06T02:33:00Z, size: Some(4) }]

Blajda avatar Jul 06 '22 02:07 Blajda

@Blajda - sorry it was a bit too late yesterday, and I had not worked carefully enough :). I updated the PR to recursively traverse the "directories" and also removed the claim that it would fix this issue completely :).

roeap avatar Jul 06 '22 06:07 roeap

I do remember now the source of this confusion ... When implementing the list_dirwe were thinking about the delta log mostly. There we do not need recursive traversals - so both local and azure are not traversing into the directories. Not entirely sure if vacuum is the only one where we actually list the data files - but i "think" it is.

roeap avatar Jul 06 '22 11:07 roeap

sorry it was a bit too late yesterday, and I had not worked carefully enough :)

No worries @roeap. Let me know if I should make a PR for the Integration framework outlined here: https://github.com/delta-io/delta-rs/compare/main...Blajda:delta-rs:backend_issue_tests I think it would be helpful to have write a single test which can be executed by multiple backends to ensure consistent interfaces. Would be useful for future backends.

Blajda avatar Jul 07 '22 03:07 Blajda

That integration framework looks really cool! I think this could be very useful to us!

One ting to consider is, that we are planning to move to a different object_strore implementation all toghether. Datafusion already model to using the object_strore_rs crate (https://github.com/influxdata/object_store_rs), and once we en-par with the autorization options, I think we will start intherating this over here as well. This should give us very consistent behaviour across all object stores as well a an abstration over object store paths.

A framework to set up different environemts for the specific backends is helpful as well .. the more operations we do the better to integration test them all against the various backends ...

roeap avatar Jul 08 '22 12:07 roeap

@Blajda @wjones127 - after merging the vacuum test suite and the preceding PRs, this is resolved, right?

roeap avatar Jul 18 '22 05:07 roeap