Large dataset Plot loads all partitions
Issue:
hvPlot by default loads all partitions into memory, see process_intake().
Proposed solution:
Two extra parameters (in addition to use_dask) sent to hvPlot.process_intake(): read_partition, dask_sample.
- dask_sample: value would be between 0 and 1.
- read_partition: would indicate which partition to read.
The contents of process_intake can be updated to:
# todo: argument validation
if dask_sample:
data = data.to_dask().sample(frac=float(dask_sample), replace=True)
elif use_dask:
data = data.to_dask()
elif read_partition:
data = data.read_partition(read_partition)
else:
data = data.read()
return data
Related
- Related hvPlot proposal https://github.com/pyviz/hvplot/issues/290
- Add "sample" and "plot" to GUI https://github.com/intake/intake/issues/182.
- intake select a subset https://github.com/pyviz/hvplot/issues/72
dask_sample: value would be between 0 and 1.
Could we generalize this? Afaik pandas also supports .sample(frac=...) so could we just call it sample_frac and have it apply to both?
I have exactly this in my code, but it is in dfviz. It could indeed be something done by Intake before passing to hvPlot, or within hvPlot - or it may be good enough where it is (because the user is always free to get the dataframe and sample however they like). I'm not sure that it matters.
The benefit of putting it in hvPlot from an intake user's perspective would be that you could declare it in the catalog yaml.
The benefit of putting it in hvPlot from an intake user's perspective would be that you could declare it in the catalog yaml.
Another benefit is that hvPlot also gets the feature. Reading dfviz, how about it accepting one of four new parameters:
def _process_data(
...,
sample_head: bool,
sample_tail: bool,
sample_partition: int,
sample_frac: float) # between 0 and 1
metadata:
plots:
bedrooms:
kind: hist
y: bedrooms
use_dask: True
sample_head: True