intake

intake copied to clipboard

Reame
Issues

Large dataset Plot loads all partitions

Open talebzeghmi opened this issue 6 years ago • 4 comments

Issue:

hvPlot by default loads all partitions into memory, see process_intake().

Proposed solution:

Two extra parameters (in addition to use_dask) sent to hvPlot.process_intake(): read_partition, dask_sample.

dask_sample: value would be between 0 and 1.
read_partition: would indicate which partition to read.

The contents of process_intake can be updated to:

    # todo: argument validation
    if dask_sample:
        data = data.to_dask().sample(frac=float(dask_sample), replace=True) 
    elif use_dask:
        data = data.to_dask()
    elif read_partition:
        data = data.read_partition(read_partition)
    else:
        data = data.read()
    return data

Related

Related hvPlot proposal https://github.com/pyviz/hvplot/issues/290
Add "sample" and "plot" to GUI https://github.com/intake/intake/issues/182.
intake select a subset https://github.com/pyviz/hvplot/issues/72

Aug 21 '19 17:08 talebzeghmi

dask_sample: value would be between 0 and 1.

Could we generalize this? Afaik pandas also supports .sample(frac=...) so could we just call it sample_frac and have it apply to both?

Aug 22 '19 11:08 philippjfr

I have exactly this in my code, but it is in dfviz. It could indeed be something done by Intake before passing to hvPlot, or within hvPlot - or it may be good enough where it is (because the user is always free to get the dataframe and sample however they like). I'm not sure that it matters.

Aug 22 '19 12:08 martindurant

The benefit of putting it in hvPlot from an intake user's perspective would be that you could declare it in the catalog yaml.

Aug 22 '19 12:08 philippjfr

The benefit of putting it in hvPlot from an intake user's perspective would be that you could declare it in the catalog yaml.

Another benefit is that hvPlot also gets the feature. Reading dfviz, how about it accepting one of four new parameters:

def _process_data(
    ..., 
    sample_head: bool, 
    sample_tail: bool,
    sample_partition: int,
    sample_frac: float) # between 0 and 1

metadata:
    plots:
        bedrooms:
            kind: hist
            y: bedrooms
            use_dask: True
            sample_head: True

Aug 22 '19 17:08 talebzeghmi