lance feat: add distribute add columns by ray

https://github.com/lancedb/lance/issues/3228

Jan 11 '25 10:01 Jay-ju

Hi @Jay-ju can you add some doc about this pr?

Jan 20 '25 03:01 SaintBacchus

error_linux_39_x86.log

Jan 24 '25 00:01 LuQQiu

Is there a reason why this can't be as simple as https://github.com/dream3d-ai/lance/pull/1/files

Feb 12 '25 17:02 tonyf

Is there a reason why this can't be as simple as https://github.com/dream3d-ai/lance/pull/1/files

@tonyf If there is only one add_column, your implementation seems to be fine. However, for other tasks such as deleting data that require distribution, implementing a class like LanceMergeColumns for each of them. Moreover, this implementation actually cannot reuse the datasource and datasink interfaces of Ray. The reason for not being able to reuse them is that add_column does not need the process of reading data from Lance in the datasource. Therefore, here we want to implement a specialized custom task that can achieve this type of task.

Feb 24 '25 06:02 Jay-ju

Is there a reason why this can't be as simple as https://github.com/dream3d-ai/lance/pull/1/files

@tonyf If there is only one add_column, your implementation seems to be fine. However, for other tasks such as deleting data that require distribution, implementing a class like LanceMergeColumns for each of them. Moreover, this implementation actually cannot reuse the datasource and datasink interfaces of Ray. The reason for not being able to reuse them is that add_column does not need the process of reading data from Lance in the datasource. Therefore, here we want to implement a specialized custom task that can achieve this type of task.

Well, it does re-use the datasink but I wonder if it actually makes sense to re-use the datasource. The existing datasource interface yields rows for processing. This is makes sense in ray world because datasets are never modified in-place. Writing requires you to always write to a new dataset. However, in lance, operations like merge & delete happen on a fragment level so breaking things out into rows, modifying them and then merging them back is non-sensical. But the api would make you believe its possible. Instead, by introducing a datasource that yields fragments this (1) makes it clear that modifications must occur on fragments and (2) has the added benefit of being able to cache progress in the case of task failures. Multiple add_columns are possible as the modified fragment & new schema are yielded to whoever consumes it. And at the end, a single commit is created for the entire modification of the ds.

Just adding my two cents as we've found ray support really crucial for working with lance at scale & think its worth digging into what abstractions make sense here.

Mar 04 '25 02:03 tonyf

Is there a reason why this can't be as simple as https://github.com/dream3d-ai/lance/pull/1/files

@tonyf If there is only one add_column, your implementation seems to be fine. However, for other tasks such as deleting data that require distribution, implementing a class like LanceMergeColumns for each of them. Moreover, this implementation actually cannot reuse the datasource and datasink interfaces of Ray. The reason for not being able to reuse them is that add_column does not need the process of reading data from Lance in the datasource. Therefore, here we want to implement a specialized custom task that can achieve this type of task.

Well, it does re-use the datasink but I wonder if it actually makes sense to re-use the datasource. The existing datasource interface yields rows for processing. This is makes sense in ray world because datasets are never modified in-place. Writing requires you to always write to a new dataset. However, in lance, operations like merge & delete happen on a fragment level so breaking things out into rows, modifying them and then merging them back is non-sensical. But the api would make you believe its possible. Instead, by introducing a datasource that yields fragments this (1) makes it clear that modifications must occur on fragments and (2) has the added benefit of being able to cache progress in the case of task failures. Multiple add_columns are possible as the modified fragment & new schema are yielded to whoever consumes it. And at the end, a single commit is created for the entire modification of the ds.

Just adding my two cents as we've found ray support really crucial for working with lance at scale & think its worth digging into what abstractions make sense here.

Yes, my purpose is actually to add a fragment's task interface in ray, not all at the row level. Directly reusing the data source is not very possible. Therefore, an in-place interface is added here, and there is no need to yield row tasks.

Mar 11 '25 01:03 Jay-ju

@Jay-ju what's the status of this PR, is it ready to be merged? Any other change do you plan to make? Community user is looking for this feature: https://discord.com/channels/1030247538198061086/1197630499926057021/1354385990738776175

Mar 26 '25 16:03 LuQQiu

@westonpace The comment has been fixed. Please merge it.

Apr 01 '25 01:04 Jay-ju

Done. Thanks again!

Apr 01 '25 11:04 westonpace