spark
spark copied to clipboard
Spark 3.5.1, .NET 8, Dependencies and Documentation
Changes:
- Implemented compatibility with Spark 3.5.1(Fixes in the separate commit),
- Updated project dependencies
- Updated .net6 -> .net8, .net461 -> .net48.
- Extracted binary formatter in a separate class + tests
- Fixed several small bugs, related to nullrefs or windows path with whitespaces.
- Added a documentation page, that contains component and sequence diagrams for Dotnet Spark. (Such diagrams would help me significantly, it's worth adding)
Tested with:
Spark
Each time on stop there's an exception, that doesn't affect execution
ERROR DotnetBackendHandler: Exception caught: java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
- Works with local 3.5.1
- Works with local 3.5.2 (If setting 'IgnorePathVersion...' is enabled in scala)
Databricks:
- Fails on 15.4:
[Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null], )
[2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)
- Fails on 14.3: Breaking commit
Error from python worker:
DotnetWorker PID:[2516] Args:[-m pyspark.daemon pyspark.worker] SparkVersion:[3.5.0]
Invalid number of args: 3
After fix 1: Grouped UDFs were reworked in DB, and now accept series of batches(10k rows each) instead of a single huge one. Also now there's a need to read 2x int in between of these batches, and merge them before pasiing to UDF. Fix 2: UDF now accepts IEnumerable of RecordBatches
After fix 2: Databricks accepts all results, but I can't escape the final command loop and finish the worker
Affected tickets:
- #1170
@dotnet-policy-service agree
@grazy27
Can you share how many of the unit tests pass
The UDF unit tests have not been updated. Are you able to get all of them to pass?
@grazy27
Can you share how many of the unit tests pass
The UDF unit tests have not been updated. Are you able to get all of them to pass?
Hello @GeorgeS2019 , they do.
Saw your issue, probably my env uses UTF8 by default.
Several tests fail from time to time with executor driver): java.nio.file.NoSuchFileException: C:\Users\grazy27\AppData\Local\Temp\spark-cc2cf7bc-3c8c-4fdf-a496-266424de943d\userFiles-92d122bb-af9a-40ea-a430-131454afc705\archive.zip
But they pass if run second time, so I didn't dive deeper
What is the status of this PR?
What is the status of this PR?
It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side.
I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well
@grazy27
Can you investigate if your solution work in polyglot .NET notebook interactive?
Previously we all had problem with the UDF after making adjustment to migrate to .net6.
https://github.com/dotnet/spark/issues/796
https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4
The next steps are on Microsoft's side.
Any idea who is "in charge" of this repo?
@grazy27
Can you investigate if your solution work in polyglot .NET notebook interactive?
Previously we all had problem with the UDF.
#796
I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus.
There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist?
Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!
Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!
Hello Dan <@wudanzy>,
That's fantastic news—great to hear!
I'd be happy to help with a few more issues to get this project back on track. In my opinion, the most important ones are:
- Support for UDFs when
UseArrow = true - Migrating to a standalone NuGet package for BinarySerializer and upgrading the solution to .NET 9
- Addressing the bug with Databricks 15.4
Thanks for sharing that!
@grazy27
Can you investigate if your solution work in polyglot .NET notebook interactive?
Previously we all had problem with the UDF after making adjustment to migrate to .net6.
#796
https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4
Followed out in https://github.com/dotnet/spark/discussions/1179, and found a related bug: https://github.com/dotnet/spark/issues/1043
Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.
Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.
Wonderful, thanks @wudanzy. I'll create a few more PRs with another improvements after this one is merged
/AzurePipelines run
Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark
Hi @grazy27 , good news is that we (thanks @SparkSnail ) have fixed the workflow and now it works for external contributors, we have just upgraded to Spark 3.3 to verify that process. Please resolve the conflicts and we will trigger the test.
Please note that the test must be triggered by a committer, it is the best outcome after discussing with internal admins. Please feel free to ping the thread when you want a new test.
You can reuse this PR, will you use another one?
Oops, thats automatic github action, I'll rebase on the weekend
/AzurePipelines run
Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark
/AzurePipelines run
Azure Pipelines successfully started running 1 pipeline(s).
Seems that you need to change global.json.
You may verify if it works by build.cmd -pack -c Release /p:PublishSparkWorker=true /p:SparkWorkerPublishDir=D:\a\path\to\Microsoft.Spark.Worker
@wudanzy thanks for pointing this out, better now.
When merging, let's use rebase or a merge, as otherwise commit history will not be visible in main
/AzurePipelines run
Azure Pipelines successfully started running 1 pipeline(s).
Are you able to reproduce the above errors locally? I didn't see the error before. Got some links from the web: https://lightrun.com/answers/dotnet-sourcelink-building-a-net-core-31-project-results-in-msb4062 https://github.com/dotnet/sourcelink/issues/386
And it is weird why it is dotnet 9.0 instead of 8.0.
I had a guess that we should not use netstandard2.1, because all failing ones has netstandard2.1, but I am not sure.

https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4