spark [FEATURE REQUEST]: GetNumPartitions

Is your feature request related to a problem? Please describe. It takes a lot of effort to detect the number of partitions in a dataframe. Currently I have to swap back and forth between the spark U/I and my driver code that is executing in the debugger. It seems like there should be an easy way to show the number of partitions, possibly with the V.S. debugger watch window.

Describe the solution you'd like It would be nice to have a simple method, GetNumPartitions. Perhaps it could be done in an extension class, if it would be confusing to added it to DataFrame directly (ie. perhaps the goals for this class discourage us from polluting it with methods that we don't find in Python/Scala).

Describe alternatives you've considered In Python and Scala these types of things have been handled by relying on the old RDD interface. See https://stackoverflow.com/questions/42171499/get-current-number-of-partitions-of-a-dataframe

I considered using reflection to try to interact with the "internal" RDD implementation that was previously available in this github project. I'd rather not go too far off the beaten path and, if this type of approach was worthwhile, then I think someone would have added it already (maybe even into a "PreviewFeatures.RddExtensions" namespace or something along those lines). Can someone tell me if there is some reasonable way to bring back certain aspects of RDD, until dataframe has reached the same level of functionality? Note that python and scala both give us full access to ".rdd" and I think it is unfair that .Net for Spark is so strict in disallowing those features. (another related thing that we might be able to accomplish with a ".rdd" member is to redefine the schema of the dataframe).

Additional context There have been other issues that would have been easier to solve with the help of ".rdd" methods.

As a matter of curiosity, is it common for developers to use reflection for accessing all the RDD functionality that way? I suppose I would be more willing to take that type of an approach if everyone else is already doing it. But it seems like a bit of a crutch, and it seems like it is being deliberately discouraged.

Apr 18 '21 21:04 dbeavon

It takes a lot of effort to detect the number of partitions in a dataframe

What's the use case where you need to know the number of partitions?

There have been other issues that would have been easier to solve with the help of ".rdd" methods.

Could you described the scenarios?

Apr 19 '21 18:04 imback82

For example, it would give insights about current partitioning state of a data frame.

There is the SparkPartitionId method in the Functions class. Knowing the partition id of a row without knowing the total number of partitions doesn't let you conclude whether and to what extent the data frame is unbalanced.

Apr 19 '21 19:04 michael-damatov

spark spark copied to clipboard

[FEATURE REQUEST]: GetNumPartitions

spark
spark copied to clipboard