Support dask dataframes as input to lens.summarise

Open zblz opened this issue 8 years ago • 0 comments

Currently, lens requires a pandas dataframe as input to the lens.summarise method. This places an upper limit on the size of the dataset analysed, which must be smaller than the available memory in the machine. Even with efficient optimisation of memory usage during the execution of the dask graph, the initial requirement prevents lens from scaling.

Ideally, lens.summarise should accept dask dataframes as input, and build the execution graph based on this delayed dataframe. This will require a rework of the functions in lens.metrics, given that all of them currently take either pd.Series or pd.Dataframe as arguments. In most cases we should be able to use the dask dataframe API, but for other metrics it will be necessary to access the individual chunks and reduce the result appropriately.

Adding this support, along with the distributed scheduler #11, will allow lens to analyse datasets significantly larger than the memory of the machine.

Aug 15 '17 14:08 zblz