fury icon indicating copy to clipboard operation
fury copied to clipboard

[Java]Best practice with Apache/Spark

Open jayhan94 opened this issue 11 months ago • 4 comments

Feature Request

Is there any best practice with apache/spark? Will the community implement such a module?

Is your feature request related to a problem? Please describe

No response

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

jayhan94 avatar Jan 20 '25 10:01 jayhan94

Hi @jayhan94 , we don't have such documents currently. A better fury integration with spark/flink would need to change the source code of serialization module in spark/flink, which is beyond the scope of this project. Maybe in future we can submit several proposal to spark/flink communities.

Currently, if you want to use fury in spark/flink, you can update your driver program to add several chained(narrow dependency in spark) serialization/deserialization operators.

Here is a simple spark rdd example:

val lines = sc.textFile("data.txt")
val structSet = lines.map(s => Json.parse(s, Struct.class))
kvset = structSet.map(s => (s.key, fury.serialize(s)))
kvset.groupByKey().map(t => (t._1, fury.deserialize(t._2.first))).collect.foreach(println)

Flink program will be similiar:

DataStream<Struct> dataStream = xxxstream.map(s -> Json.parse(s, Struct.class));
DataStream<byte[]> byteStream = dataStream.map(s -> json.serialize(s));
byteStream.rebalance().map(bytes -> (Struct)fury.deserialize(bytes));

chaokunyang avatar Jan 30 '25 06:01 chaokunyang

@chaokunyang Thanks for your reply. I don't learn about the serializer of rdd. I meant to implement spark.serializer based on fury which may be helpful to the shuffle process just like KryoSerializer.

jayhan94 avatar Feb 02 '25 04:02 jayhan94

@jayhan94 Have you already tested fury to implement spark.serializer?

imarch1 avatar Mar 19 '25 07:03 imarch1

@jayhan94 Have you already tested fury to implement spark.serializer?

I haven’t conducted rigorous testing; I only ran a demo in the production environment. In my test data scenario, its throughput was slightly better than Kryo's—I can’t recall the exact figure, but it was roughly 20-30%.

jayhan94 avatar Mar 19 '25 13:03 jayhan94