datachain
datachain copied to clipboard
Epic: File IO in application level
### Must have for the release
- [ ] iterative/dvcx#1635
- [ ] iterative/datachain#32
- [ ] iterative/datachain#34
- [ ] iterative/datachain#35
- [ ] iterative/datachain#36
- [ ] iterative/datachain#37
- [ ] https://github.com/iterative/datachain/issues/212
### Nice-to-have for the release
- [ ] iterative/datachain#38
- [ ] iterative/datachain#39
- [ ] iterative/datachain#40
- [ ] File diff
### Next steps
- [ ] Data upload
- [ ] iterative/dvcx#1358 (re-implemented async)
Do we have idea how we are going to do indexing?
I'd really love to get rid of the file_columns().
https://github.com/iterative/dvcx/blob/4661a0b42b518478f0c03b4728a41652b3ea967f/src/datachain/data_storage/schema.py#L241-L257
Do we have idea how we are going to do indexing?
The general idea is that it'll use File objects and behave like existing generators, but the details are still TBD.
I am sorry. By indexing, I meant database indexing (index=True) options, not storage indexing.
I'd really love to get rid of the
file_columns().
💯 we need to do that!
Re the indexing... are you asking if we need indexes in the DB? No, we don't need indexes. I'd only suggest creating an index for the mandatory column sys_id - it will speed up the joins after map().
In general (in analytical use cases), users don't know what they will be querying next. Indexes are not helpful. This is also a reason why datawharehouses are better fit for crunching data and for us in particular (DW do not have indexes).
@rlamy please note, I transferred all the file issues to this repository.
@dreadatour @dmpetrov can you please add a bit more details on diff and on recent examples from customers. It is the last item that is left here.
@shcheklein I spoke with Dmitry and we decided that diff should be like this: https://github.com/iterative/datachain/issues/636 ... I will start working on it soon.
This epic seems done 🎉 Closing
Let's work on diff() and data upload separately