datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Replace deprecated listing / indexing with new `DataChain.from_storage()`

Open ilongin opened this issue 1 year ago • 2 comments

In related PR https://github.com/iterative/datachain/pull/294 we are going to have listing logic inside of DataChain.from_storage() itself.

We should replace old listing that is being called from CLI and maybe other places with DataChain.from_storage(). There is a lot of tests around listing / indexing and we should refactor them as well if needed

As a follow up for this we should remove old legacy listing codebase (maybe it's better to do this in actual separate issue / PR). We should also remove buckets and partials

Note that we also need to replace Catalog.ls_storages to use new listing datasets as bucket table will be removed, as well as partials

  • [ ] Refactor datachain.Dataset.is_bucket_listing() to not use old listing check and maybe remove this method altogether as low level Dataset class should not know about LISTING_PREFIX and similar higher level abstractions

ilongin avatar Aug 19 '24 16:08 ilongin

I think this should be prioritized as it's sometimes hard to refactor / change codebase as this old indexing part needs to be adopted. It would been much easier if it's just refactored to "new" indexing, not to mention it can only be used for CLI operations .. .e.g when we call Catalog.index(...) listing that's been created cannot be used in DataChain methods

ilongin avatar Oct 04 '24 08:10 ilongin

Agreed. Moved it to the ready column - let's keep simplifying things there.

shcheklein avatar Oct 04 '24 13:10 shcheklein