datachain Pre-UDF table logic creates unnecessary copies

The pre-UDF logic in https://github.com/iterative/datachain/blob/ee43fd16b751db751a3b70e7833483aea3591232/src/datachain/query/dataset.py#L589-L598 unconditionally copies the input query into a new table, which is expensive and useless in most cases.

For context, this was introduced in https://github.com/iterative/dvcx/pull/1068

Sep 18 '24 18:09 rlamy

How will we overcome the issue that was solved in that https://github.com/iterative/dvcx/pull/1068 you mentioned if we are not going to copy query to temp table?

Sep 18 '24 23:09 ilongin

IIUC, that issue only happens when the result-set of the query is non-deterministic, i.e. only when there's a non-deterministic filter on the rows (such as LIMIT). So I can see 2 solutions:

Detect deterministic cases and avoid copying in that case.
Rely on row ids to avoid running the filter twice: create the UDF table based on the original query, and then inner-join on row ids with an unfiltered version of the query.

Sep 19 '24 10:09 rlamy