datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Epic: move indexing to an application layer

Open skshetry opened this issue 1 year ago • 6 comments

Description

i.e make it based on a feature schema and if possible, with udfs.

### Subtasks
- [ ] https://github.com/iterative/datachain/issues/246
- [ ] https://github.com/iterative/datachain/issues/244
- [ ] https://github.com/iterative/datachain/issues/266
- [ ] https://github.com/iterative/datachain/issues/317
- [ ] https://github.com/iterative/datachain/issues/318
- [ ] https://github.com/iterative/datachain/issues/329
- [ ] https://github.com/iterative/datachain/issues/340
- [ ] https://github.com/iterative/datachain/issues/447

skshetry avatar Jun 18 '24 05:06 skshetry

We need to think how to deal with additional tables that are created during indexing, like buckets or partials. So this is not just normal UDF that has an output of some rows in a dataset table, but needs to insert into buckets and partials tables. It's easy for us to implement this, but if we want users to implement their own indexing maybe we need to provide framework to do so implicitly (user should not care about those tables explicitly) ... WDYT?

ilongin avatar Jul 22 '24 15:07 ilongin

I think we should start getting rid of partials. They are too complicated for the value they provide. Same with buckets / sources - I would reconsider also drop them.

Each path that we pass to from_storage can be creating a versioned dataset. We can decide to reuse those (as a way to cache things) with some expiration date, etc.

What are the major things we are loosing by getting rid of bucket, sources, partials?

shcheklein avatar Jul 22 '24 15:07 shcheklein

Partials are needed to be able to index part of a bucket and to avoid re-indexing subdirectories. I have a feeling though that this can all be done even without that partials table, just on the fly but this needs to be investigated.

ilongin avatar Jul 22 '24 16:07 ilongin

I think we should start getting rid of partials. T

and

that this can all be done even without that partials table, just on the fly but this needs to be investigated.

Both are good ideas! Let's try to simplify this as much as we can.

We need to think how to deal with additional tables that are created during indexing, like buckets or partials. So this is not just normal UDF that has an output of some rows in a dataset table

Right. We need to find a way to fit the buckets (as well as partials i if needed) into "just normal UDF" and normal datasets. I hope these datasets won't be visible to users (by default).

dmpetrov avatar Jul 30 '24 18:07 dmpetrov

Prioritizing this. It's an epic. Need to add first steps.

shcheklein avatar Jul 31 '24 16:07 shcheklein

I can take over this one and make a plan / subtasks

ilongin avatar Jul 31 '24 18:07 ilongin