spark
spark copied to clipboard
[BUG]: Performance considerations
Hello, this isn't a tradition bug report. Its more a question for the dotnet+spark team and a bit of info on how my company, Tonic, has been using this project.
My company, Tonic, builds what is essentially an ETL tool. We copy data from point A to point B and transform the data along the way. Our transformations are meant to increase the privacy of the data and don't modify the schema of the data. So for example, a transformation might be something like: "For each character in a string, replace it with a random character of the same type".
We've been supporting Databricks, Spark, and AWS EMR for over a year now for our customers that have large amounts of data sitting on a file system, e.g. files in Parquet cataloged with Hive and processed with Spark.
Since our stack is entirely in C# we didn't want to re-create all of our data transformations in Java or Scala so we have been using this project with some success to keep our stack entirely dotnet. BUT, we have some customers with such large amounts of data that we've actually had to begin porting our transformations to Java because we've noticed the perf hit of using dotnet+spark is simply too high. Its many orders of magnitude slower.
It looks like the cost is entirely in the serialization and deserialization that takes place between the JVM and CLR. Can someone from the team please provide a bit of an explanation for the serialization taking place? Of course, something has to happen to send data back and forth between these two different processes but it also looks like python is somehow being thrown into the mix? So is the serialization something like:
JVM->Python->CLR->Transform in c# UDF->Python->JVM?
If that is the case, why was it decided to include Python? Also, has performance at any point been a focus of your team? It is something I'd be happy to investigate (and we have plenty of real world use cases where it can be validated) but I would need some help in understanding what is going on and perhaps some pointers to get started.
Thanks, Adam.
@akamor Please take a look at the following SPIP SPARK-26257 as to why we've had to use python serialization in order to integrate with the current Spark Interop layer.
We've also ran TPC-H benchmarks showcasing the performance difference between Python and C# as can be seen in the following blog post https://devblogs.microsoft.com/dotnet/introducing-net-for-apache-spark/ . The benchmark code can be found here.