intake icon indicating copy to clipboard operation
intake copied to clipboard

Large dataset Plot loads all partitions

Open talebzeghmi opened this issue 6 years ago • 4 comments

Issue:

hvPlot by default loads all partitions into memory, see process_intake().

Proposed solution:

Two extra parameters (in addition to use_dask) sent to hvPlot.process_intake(): read_partition, dask_sample.

  • dask_sample: value would be between 0 and 1.
  • read_partition: would indicate which partition to read.

The contents of process_intake can be updated to:

    # todo: argument validation
    if dask_sample:
        data = data.to_dask().sample(frac=float(dask_sample), replace=True) 
    elif use_dask:
        data = data.to_dask()
    elif read_partition:
        data = data.read_partition(read_partition)
    else:
        data = data.read()
    return data

Related

  • Related hvPlot proposal https://github.com/pyviz/hvplot/issues/290
  • Add "sample" and "plot" to GUI https://github.com/intake/intake/issues/182.
  • intake select a subset https://github.com/pyviz/hvplot/issues/72

talebzeghmi avatar Aug 21 '19 17:08 talebzeghmi

dask_sample: value would be between 0 and 1.

Could we generalize this? Afaik pandas also supports .sample(frac=...) so could we just call it sample_frac and have it apply to both?

philippjfr avatar Aug 22 '19 11:08 philippjfr

I have exactly this in my code, but it is in dfviz. It could indeed be something done by Intake before passing to hvPlot, or within hvPlot - or it may be good enough where it is (because the user is always free to get the dataframe and sample however they like). I'm not sure that it matters.

martindurant avatar Aug 22 '19 12:08 martindurant

The benefit of putting it in hvPlot from an intake user's perspective would be that you could declare it in the catalog yaml.

philippjfr avatar Aug 22 '19 12:08 philippjfr

The benefit of putting it in hvPlot from an intake user's perspective would be that you could declare it in the catalog yaml.

Another benefit is that hvPlot also gets the feature. Reading dfviz, how about it accepting one of four new parameters:

def _process_data(
    ..., 
    sample_head: bool, 
    sample_tail: bool,
    sample_partition: int,
    sample_frac: float) # between 0 and 1
metadata:
    plots:
        bedrooms:
            kind: hist
            y: bedrooms
            use_dask: True
            sample_head: True

talebzeghmi avatar Aug 22 '19 17:08 talebzeghmi