datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Epic: File IO in application level

Open dmpetrov opened this issue 1 year ago • 5 comments

### Must have for the release
- [ ] iterative/dvcx#1635
- [ ] iterative/datachain#32
- [ ] iterative/datachain#34
- [ ] iterative/datachain#35
- [ ] iterative/datachain#36
- [ ] iterative/datachain#37
- [ ] https://github.com/iterative/datachain/issues/212
### Nice-to-have for the release
- [ ] iterative/datachain#38
- [ ] iterative/datachain#39
- [ ] iterative/datachain#40
- [ ] File diff
### Next steps
- [ ] Data upload
- [ ] iterative/dvcx#1358 (re-implemented async)

dmpetrov avatar Jun 27 '24 00:06 dmpetrov

Do we have idea how we are going to do indexing?

I'd really love to get rid of the file_columns().

https://github.com/iterative/dvcx/blob/4661a0b42b518478f0c03b4728a41652b3ea967f/src/datachain/data_storage/schema.py#L241-L257

skshetry avatar Jul 08 '24 12:07 skshetry

Do we have idea how we are going to do indexing?

The general idea is that it'll use File objects and behave like existing generators, but the details are still TBD.

rlamy avatar Jul 08 '24 13:07 rlamy

I am sorry. By indexing, I meant database indexing (index=True) options, not storage indexing.

skshetry avatar Jul 08 '24 17:07 skshetry

I'd really love to get rid of the file_columns().

💯 we need to do that!

Re the indexing... are you asking if we need indexes in the DB? No, we don't need indexes. I'd only suggest creating an index for the mandatory column sys_id - it will speed up the joins after map().

In general (in analytical use cases), users don't know what they will be querying next. Indexes are not helpful. This is also a reason why datawharehouses are better fit for crunching data and for us in particular (DW do not have indexes).

dmpetrov avatar Jul 08 '24 17:07 dmpetrov

@rlamy please note, I transferred all the file issues to this repository.

dmpetrov avatar Jul 13 '24 05:07 dmpetrov

@dreadatour @dmpetrov can you please add a bit more details on diff and on recent examples from customers. It is the last item that is left here.

shcheklein avatar Nov 26 '24 05:11 shcheklein

@shcheklein I spoke with Dmitry and we decided that diff should be like this: https://github.com/iterative/datachain/issues/636 ... I will start working on it soon.

ilongin avatar Nov 26 '24 10:11 ilongin

This epic seems done 🎉 Closing

dmpetrov avatar Dec 24 '24 03:12 dmpetrov

Let's work on diff() and data upload separately

dmpetrov avatar Dec 24 '24 03:12 dmpetrov