lance icon indicating copy to clipboard operation
lance copied to clipboard

feat: add distribute add columns by ray

Open Jay-ju opened this issue 11 months ago • 4 comments

https://github.com/lancedb/lance/issues/3228

Jay-ju avatar Jan 11 '25 10:01 Jay-ju

Hi @Jay-ju can you add some doc about this pr?

SaintBacchus avatar Jan 20 '25 03:01 SaintBacchus

Is there a reason why this can't be as simple as https://github.com/dream3d-ai/lance/pull/1/files

tonyf avatar Feb 12 '25 17:02 tonyf

Is there a reason why this can't be as simple as https://github.com/dream3d-ai/lance/pull/1/files

@tonyf If there is only one add_column, your implementation seems to be fine. However, for other tasks such as deleting data that require distribution, implementing a class like LanceMergeColumns for each of them. Moreover, this implementation actually cannot reuse the datasource and datasink interfaces of Ray. The reason for not being able to reuse them is that add_column does not need the process of reading data from Lance in the datasource. Therefore, here we want to implement a specialized custom task that can achieve this type of task.

Jay-ju avatar Feb 24 '25 06:02 Jay-ju

Is there a reason why this can't be as simple as https://github.com/dream3d-ai/lance/pull/1/files

@tonyf If there is only one add_column, your implementation seems to be fine. However, for other tasks such as deleting data that require distribution, implementing a class like LanceMergeColumns for each of them. Moreover, this implementation actually cannot reuse the datasource and datasink interfaces of Ray. The reason for not being able to reuse them is that add_column does not need the process of reading data from Lance in the datasource. Therefore, here we want to implement a specialized custom task that can achieve this type of task.

Well, it does re-use the datasink but I wonder if it actually makes sense to re-use the datasource. The existing datasource interface yields rows for processing. This is makes sense in ray world because datasets are never modified in-place. Writing requires you to always write to a new dataset. However, in lance, operations like merge & delete happen on a fragment level so breaking things out into rows, modifying them and then merging them back is non-sensical. But the api would make you believe its possible. Instead, by introducing a datasource that yields fragments this (1) makes it clear that modifications must occur on fragments and (2) has the added benefit of being able to cache progress in the case of task failures. Multiple add_columns are possible as the modified fragment & new schema are yielded to whoever consumes it. And at the end, a single commit is created for the entire modification of the ds.

Just adding my two cents as we've found ray support really crucial for working with lance at scale & think its worth digging into what abstractions make sense here.

tonyf avatar Mar 04 '25 02:03 tonyf

Is there a reason why this can't be as simple as https://github.com/dream3d-ai/lance/pull/1/files

@tonyf If there is only one add_column, your implementation seems to be fine. However, for other tasks such as deleting data that require distribution, implementing a class like LanceMergeColumns for each of them. Moreover, this implementation actually cannot reuse the datasource and datasink interfaces of Ray. The reason for not being able to reuse them is that add_column does not need the process of reading data from Lance in the datasource. Therefore, here we want to implement a specialized custom task that can achieve this type of task.

Well, it does re-use the datasink but I wonder if it actually makes sense to re-use the datasource. The existing datasource interface yields rows for processing. This is makes sense in ray world because datasets are never modified in-place. Writing requires you to always write to a new dataset. However, in lance, operations like merge & delete happen on a fragment level so breaking things out into rows, modifying them and then merging them back is non-sensical. But the api would make you believe its possible. Instead, by introducing a datasource that yields fragments this (1) makes it clear that modifications must occur on fragments and (2) has the added benefit of being able to cache progress in the case of task failures. Multiple add_columns are possible as the modified fragment & new schema are yielded to whoever consumes it. And at the end, a single commit is created for the entire modification of the ds.

Just adding my two cents as we've found ray support really crucial for working with lance at scale & think its worth digging into what abstractions make sense here.

Yes, my purpose is actually to add a fragment's task interface in ray, not all at the row level. Directly reusing the data source is not very possible. Therefore, an in-place interface is added here, and there is no need to yield row tasks.

Jay-ju avatar Mar 11 '25 01:03 Jay-ju

@Jay-ju what's the status of this PR, is it ready to be merged? Any other change do you plan to make? Community user is looking for this feature: https://discord.com/channels/1030247538198061086/1197630499926057021/1354385990738776175

LuQQiu avatar Mar 26 '25 16:03 LuQQiu

@westonpace The comment has been fixed. Please merge it.

Jay-ju avatar Apr 01 '25 01:04 Jay-ju

Done. Thanks again!

westonpace avatar Apr 01 '25 11:04 westonpace