Daft supports the distributed merge_columns/create_index of Lance
Is your feature request related to a problem?
Currently, daft only supports distributed read and write of lance. The merge_columns feature of lance itself is very useful for scenarios of adding columns. Here, it should be emphasized that the add_column mentioned here is not the with_columns in daft. with_columns is an in-memory operation, while add_column is persistently bound to the lance object.
I would like to propose the addition of this feature here. Preliminarily, there are several ideas as follows:
-
Implement merge_column as a type of write_lance, write_lance(operation = [append/create/merge]). If it is merge, compare the differences between the schema in the dataframe and the schema in the lance dataset. The newly added columns will be distributed in the form of merge_column. This approach seems a bit strange semantically, but lance itself can uniformly commit operations such as append/create/merge/update
-
Add a task framework, similar to supporting a fixed workflow, encapsulating several execution paradigms. However, the prerequisite here is to support operators like map/map_batches.
merge_column template
ds = daft.from_list([lance.fragment_id)] )
ds = ds.map(v -> process(fid -> merge_column))
ds.collect()
create_index template
ds = daft.from_list([lance.fragment_id)] )
ds = ds.map(v -> process(fid -> create_index))
ds.collect()
Describe the solution you'd like
as above
Describe alternatives you've considered
No response
Additional Context
No response
Would you like to implement a fix?
No
Sounds interesting, would def be something we would be looking help for!
Sounds interesting, would def be something we would be looking help for!
@srilman I really want to hear your suggestions here. This is also what I've been doing recently.