datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Pre-UDF table logic creates unnecessary copies

Open rlamy opened this issue 1 year ago • 2 comments

The pre-UDF logic in https://github.com/iterative/datachain/blob/ee43fd16b751db751a3b70e7833483aea3591232/src/datachain/query/dataset.py#L589-L598 unconditionally copies the input query into a new table, which is expensive and useless in most cases.

For context, this was introduced in https://github.com/iterative/dvcx/pull/1068

rlamy avatar Sep 18 '24 18:09 rlamy

How will we overcome the issue that was solved in that https://github.com/iterative/dvcx/pull/1068 you mentioned if we are not going to copy query to temp table?

ilongin avatar Sep 18 '24 23:09 ilongin

IIUC, that issue only happens when the result-set of the query is non-deterministic, i.e. only when there's a non-deterministic filter on the rows (such as LIMIT). So I can see 2 solutions:

  • Detect deterministic cases and avoid copying in that case.
  • Rely on row ids to avoid running the filter twice: create the UDF table based on the original query, and then inner-join on row ids with an unfiltered version of the query.

rlamy avatar Sep 19 '24 10:09 rlamy