Daft Write a guide on partitioning

Write a guide on partitioning

Open jaychia opened this issue 1 year ago • 4 comments

Write a guide to enumerate key concepts around partitioning:

Increasing the number of partitions in your DataFrame has the following effects:

1. Increase the amount of parallelism available to your workload (since more partitions can be processed in parallel)
2. Decrease the efficiency of "global operations" such as a sort, aggregation or join (since data is split on a finer level of granularity and more shuffling of data is required)
3. Decrease the peak memory utilization of processing each partition (since your partitions will be smaller and take less memory to process per-partition)

Also, how to think about partitioning schemes, partition columns and ways to repartition a dataframe.

When this is done, we should also link the document in our Key Concepts quickstart doc.