Pre-UDF table logic creates unnecessary copies
The pre-UDF logic in https://github.com/iterative/datachain/blob/ee43fd16b751db751a3b70e7833483aea3591232/src/datachain/query/dataset.py#L589-L598 unconditionally copies the input query into a new table, which is expensive and useless in most cases.
For context, this was introduced in https://github.com/iterative/dvcx/pull/1068
How will we overcome the issue that was solved in that https://github.com/iterative/dvcx/pull/1068 you mentioned if we are not going to copy query to temp table?
IIUC, that issue only happens when the result-set of the query is non-deterministic, i.e. only when there's a non-deterministic filter on the rows (such as LIMIT). So I can see 2 solutions:
- Detect deterministic cases and avoid copying in that case.
- Rely on row ids to avoid running the filter twice: create the UDF table based on the original query, and then inner-join on row ids with an unfiltered version of the query.