dask-sql
dask-sql copied to clipboard
Optimize fast codepath for applying limits
Often, we will specify a LIMIT
on the result of a complex query, which can potentially leave us with a Dask dataframe with several empty partitions (in my case, I encountered this with cross-join operations). In these cases, checking the length of the dataframe's first partition doesn't really give us a good idea of if we can apply a LIMIT
on the dataframe with minimal persists using head()
, as the first partition actually containing data could be the Nth partition of the dataframe.
This PR resolves this by expanding the partition length check to grab the length of the first nonempty partition; this new check ideally shouldn't result in any significant extra persists, as the only additional dataframes we are loading into memory are empty.
cc @ayushdg @randerzander
Also, I'm kinda scratching my head as to why we're doing this check in the first place - perhaps I need to look into Dask's implementation of head()
, but I would assume that if we don't have to apply an OFFSET
that head()
would always be the optimal way to grab the requested data in terms of persists?
EDIT:
Just looked into Dask's head()
function now and have a better understanding of what's going on here - think there is probably a relatively simple way to get LIMITS
without OFFSETS
always working using head()
Codecov Report
Merging #419 (f7a8a08) into main (0372ebc) will increase coverage by
0.14%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #419 +/- ##
==========================================
+ Coverage 89.02% 89.17% +0.14%
==========================================
Files 69 69
Lines 3327 3326 -1
Branches 654 653 -1
==========================================
+ Hits 2962 2966 +4
+ Misses 296 287 -9
- Partials 69 73 +4
Impacted Files | Coverage Δ | |
---|---|---|
dask_sql/physical/rel/logical/limit.py | 92.50% <100.00%> (-0.36%) |
:arrow_down: |
dask_sql/physical/utils/map.py | 100.00% <0.00%> (ø) |
|
dask_sql/_version.py | 34.00% <0.00%> (+1.44%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 0372ebc...f7a8a08. Read the comment docs.
Superceded by #696