Daft icon indicating copy to clipboard operation
Daft copied to clipboard

Write a guide on partitioning

Open jaychia opened this issue 1 year ago • 4 comments

Write a guide to enumerate key concepts around partitioning:

Increasing the number of partitions in your DataFrame has the following effects:

1. Increase the amount of parallelism available to your workload (since more partitions can be processed in parallel)
2. Decrease the efficiency of "global operations" such as a sort, aggregation or join (since data is split on a finer level of granularity and more shuffling of data is required)
3. Decrease the peak memory utilization of processing each partition (since your partitions will be smaller and take less memory to process per-partition)

Also, how to think about partitioning schemes, partition columns and ways to repartition a dataframe.

When this is done, we should also link the document in our Key Concepts quickstart doc.

jaychia avatar Apr 24 '23 22:04 jaychia

Hi @jaychia is this still open? can I work on it?

hannydevelop avatar Jul 03 '23 09:07 hannydevelop

Hi @hannydevelop yes! Please feel free to come up with a proposal, and we are also happy to chat about it in person. Come talk to us on the Slack

jaychia avatar Jul 09 '23 20:07 jaychia

hello @jaychia great, I'll work on this now

hannydevelop avatar Jul 19 '23 12:07 hannydevelop

More information on partitioning would be great!

ghalimi avatar Mar 10 '24 00:03 ghalimi