spark
spark copied to clipboard
Question: How to use DataFrame API to achieve the function equivalent to map/reduce in spark.net
Hi, we have a scenario that need to use the map/reduce function in spark.net, For example, we want to call
public IEnumerable<object[]> MapCallback(IEnumerable<Row> input)
{
// do something with `IEnumerable<Row> input`
}
df.Rdd.MapPartitions(MapCallback, true)
The thing is, we need this IEnumerable<Row> input so that we can do some operation on the row level. In Mobius, we can access all the Rdd-related APIs, but according to this issue seems all the Rdd-related APIs are no longer accessible.
So we have the following questions:
- Is there any API in current Spark.Net that can implement the function that exactly equivalent to Rdd.Map, Rdd.Reduce and other mapreduce related function? Note that we need to deal with a IEnumerable<Row> with arbitrary number of elements in one row, i.e., we may not know how many elements (columns) in a row until runtime.
- If the answer of 1 is false, can we just download the source code, change the visibility of Rdd-related APIs to public, and build a private bits to use?
- Any other related suggestions will be really appreciated.
Looking forward to your answer! Thanks a lot!