datachain Epic: File IO in application level

### Must have for the release
- [ ] iterative/dvcx#1635
- [ ] iterative/datachain#32
- [ ] iterative/datachain#34
- [ ] iterative/datachain#35
- [ ] iterative/datachain#36
- [ ] iterative/datachain#37
- [ ] https://github.com/iterative/datachain/issues/212

### Nice-to-have for the release
- [ ] iterative/datachain#38
- [ ] iterative/datachain#39
- [ ] iterative/datachain#40
- [ ] File diff

### Next steps
- [ ] Data upload
- [ ] iterative/dvcx#1358 (re-implemented async)

Jun 27 '24 00:06 dmpetrov

Do we have idea how we are going to do indexing?

I'd really love to get rid of the file_columns().

https://github.com/iterative/dvcx/blob/4661a0b42b518478f0c03b4728a41652b3ea967f/src/datachain/data_storage/schema.py#L241-L257

Jul 08 '24 12:07 skshetry

Do we have idea how we are going to do indexing?

The general idea is that it'll use File objects and behave like existing generators, but the details are still TBD.

Jul 08 '24 13:07 rlamy

I am sorry. By indexing, I meant database indexing (index=True) options, not storage indexing.

Jul 08 '24 17:07 skshetry

I'd really love to get rid of the file_columns().

💯 we need to do that!

Re the indexing... are you asking if we need indexes in the DB? No, we don't need indexes. I'd only suggest creating an index for the mandatory column sys_id - it will speed up the joins after map().

In general (in analytical use cases), users don't know what they will be querying next. Indexes are not helpful. This is also a reason why datawharehouses are better fit for crunching data and for us in particular (DW do not have indexes).

Jul 08 '24 17:07 dmpetrov

@rlamy please note, I transferred all the file issues to this repository.

Jul 13 '24 05:07 dmpetrov

@dreadatour @dmpetrov can you please add a bit more details on diff and on recent examples from customers. It is the last item that is left here.

Nov 26 '24 05:11 shcheklein

@shcheklein I spoke with Dmitry and we decided that diff should be like this: https://github.com/iterative/datachain/issues/636 ... I will start working on it soon.

Nov 26 '24 10:11 ilongin

This epic seems done 🎉 Closing

Dec 24 '24 03:12 dmpetrov

Let's work on diff() and data upload separately

Dec 24 '24 03:12 dmpetrov

datachain datachain copied to clipboard

Epic: File IO in application level

datachain
datachain copied to clipboard