ibis icon indicating copy to clipboard operation
ibis copied to clipboard

perf(dask): avoid triggering compute for dynamic limit/offset

Open kszucs opened this issue 1 year ago • 0 comments

You should be able to avoid the extra compute here by doing a map_partitions call of a function that applies the between operation below. This will avoid excess compute if the values of n/offset are dependent on anything in df.

Something like (untested):

# define this at the top-level for cheaper pickling
def limit_chunk(df, col, n, offset):
    n = n.iat[0, 0]
    offset = offset.iat[0, 0]
    if n is None:
        return df[df[col] >= offset]
    else:
        return df[df[col].between(offset, offset + n - 1)]

# then in the visit function, something like:
return df.map_partitions(limit, col=name, n=n, offset=offset, align_partitions=False, meta=df._meta).drop(columns=[name])

Originally posted by @jcrist in https://github.com/ibis-project/ibis/pull/8005#discussion_r1473081193

kszucs avatar Feb 01 '24 19:02 kszucs