big_data_benchmarks
big_data_benchmarks copied to clipboard
big data technologies comparisons for cleaning, manipulating and generally wrangling data in purpose of analysis and machine learning.
Wouldn't the code below at spark-koalas.ipynb cause self-recursion? ``` def filter_data(df): filtered = df[expr_filter] p = filter_data(data).to_pandas() del p return filtered ```
Example notebook to demonstrate writing a parquet file with small partitions. This will require some minor changes to work in an AWS environment.
This is a bit of a better comparison and makes dask run. Instead of materializing a column, we nog aggregate (take the mean). And we don't ask dask to materialize...