dask-sql icon indicating copy to clipboard operation
dask-sql copied to clipboard

Optimize fast codepath for applying limits

Open charlesbluca opened this issue 2 years ago • 2 comments

Often, we will specify a LIMIT on the result of a complex query, which can potentially leave us with a Dask dataframe with several empty partitions (in my case, I encountered this with cross-join operations). In these cases, checking the length of the dataframe's first partition doesn't really give us a good idea of if we can apply a LIMIT on the dataframe with minimal persists using head(), as the first partition actually containing data could be the Nth partition of the dataframe.

This PR resolves this by expanding the partition length check to grab the length of the first nonempty partition; this new check ideally shouldn't result in any significant extra persists, as the only additional dataframes we are loading into memory are empty.

cc @ayushdg @randerzander

charlesbluca avatar Mar 08 '22 18:03 charlesbluca

Also, I'm kinda scratching my head as to why we're doing this check in the first place - perhaps I need to look into Dask's implementation of head(), but I would assume that if we don't have to apply an OFFSET that head() would always be the optimal way to grab the requested data in terms of persists?

EDIT:

Just looked into Dask's head() function now and have a better understanding of what's going on here - think there is probably a relatively simple way to get LIMITS without OFFSETS always working using head()

charlesbluca avatar Mar 08 '22 18:03 charlesbluca

Codecov Report

Merging #419 (f7a8a08) into main (0372ebc) will increase coverage by 0.14%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #419      +/-   ##
==========================================
+ Coverage   89.02%   89.17%   +0.14%     
==========================================
  Files          69       69              
  Lines        3327     3326       -1     
  Branches      654      653       -1     
==========================================
+ Hits         2962     2966       +4     
+ Misses        296      287       -9     
- Partials       69       73       +4     
Impacted Files Coverage Δ
dask_sql/physical/rel/logical/limit.py 92.50% <100.00%> (-0.36%) :arrow_down:
dask_sql/physical/utils/map.py 100.00% <0.00%> (ø)
dask_sql/_version.py 34.00% <0.00%> (+1.44%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0372ebc...f7a8a08. Read the comment docs.

codecov-commenter avatar Mar 10 '22 15:03 codecov-commenter

Superceded by #696

charlesbluca avatar Aug 31 '22 13:08 charlesbluca