Daft
Daft copied to clipboard
Write a guide on partitioning
Write a guide to enumerate key concepts around partitioning:
Increasing the number of partitions in your DataFrame has the following effects:
1. Increase the amount of parallelism available to your workload (since more partitions can be processed in parallel)
2. Decrease the efficiency of "global operations" such as a sort, aggregation or join (since data is split on a finer level of granularity and more shuffling of data is required)
3. Decrease the peak memory utilization of processing each partition (since your partitions will be smaller and take less memory to process per-partition)
Also, how to think about partitioning schemes, partition columns and ways to repartition a dataframe.
When this is done, we should also link the document in our Key Concepts
quickstart doc.
Hi @jaychia is this still open? can I work on it?
Hi @hannydevelop yes! Please feel free to come up with a proposal, and we are also happy to chat about it in person. Come talk to us on the Slack
hello @jaychia great, I'll work on this now
More information on partitioning would be great!