dask-examples
dask-examples copied to clipboard
Dask Bag examples
We currently lack dask bag examples in this repository. Two come to mind:
- Read JSON data, and do some groupby aggregation with both
Bag.groupby
andBag.foldby
- Read text data and do some basic wordcount
For the JSON data it might make sense to add a dataset generation tool for nested records data, similar to dask.datasets.timeseries
, and then use that to generate JSON data to disk, similar to how we generate CSV data in http://examples.dask.org/dataframes/01-data-access.html#Create-artificial-dataset.
We would then read the JSON data, and do some minimal processing.
For the text data I wonder if there is an online dataset we can download. I suspect that the complete works of shakespeare is around somewhere. We might do a simple thing like read, split, frequencies. Or we might do more complex work afterwards by bringing in NLTK, stemming words, removing stopwords, etc..