spark Spark 3.5.1, .NET 8, Dependencies and Documentation

Changes:

Implemented compatibility with Spark 3.5.1(Fixes in the separate commit),
Updated project dependencies
Updated .net6 -> .net8, .net461 -> .net48.
Extracted binary formatter in a separate class + tests
Fixed several small bugs, related to nullrefs or windows path with whitespaces.
Added a documentation page, that contains component and sequence diagrams for Dotnet Spark. (Such diagrams would help me significantly, it's worth adding)

Tested with:

Spark

Each time on stop there's an exception, that doesn't affect execution ERROR DotnetBackendHandler: Exception caught: java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)

Works with local 3.5.1
Works with local 3.5.2 (If setting 'IgnorePathVersion...' is enabled in scala)

Databricks:

Fails on 15.4:

[Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null], )
[2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
	at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
	at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)

Fails on 14.3: Breaking commit

Error from python worker:
  DotnetWorker PID:[2516] Args:[-m pyspark.daemon pyspark.worker] SparkVersion:[3.5.0]
  Invalid number of args: 3

After fix 1: Grouped UDFs were reworked in DB, and now accept series of batches(10k rows each) instead of a single huge one. Also now there's a need to read 2x int in between of these batches, and merge them before pasiing to UDF. Fix 2: UDF now accepts IEnumerable of RecordBatches

After fix 2: Databricks accepts all results, but I can't escape the final command loop and finish the worker

Affected tickets:

#1170

Jul 08 '24 20:07 grazy27

@dotnet-policy-service agree

Jul 08 '24 20:07 grazy27

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

Jul 22 '24 04:07 GeorgeS2019

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

Hello @GeorgeS2019 , they do.

Saw your issue, probably my env uses UTF8 by default. Several tests fail from time to time with executor driver): java.nio.file.NoSuchFileException: C:\Users\grazy27\AppData\Local\Temp\spark-cc2cf7bc-3c8c-4fdf-a496-266424de943d\userFiles-92d122bb-af9a-40ea-a430-131454afc705\archive.zip But they pass if run second time, so I didn't dive deeper

Jul 22 '24 11:07 grazy27

What is the status of this PR?

Aug 26 '24 07:08 travis-leith

What is the status of this PR?

It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side.

I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well

Aug 26 '24 07:08 grazy27

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

https://github.com/dotnet/spark/issues/796

https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

Aug 26 '24 07:08 GeorgeS2019

The next steps are on Microsoft's side.

Any idea who is "in charge" of this repo?

Aug 26 '24 07:08 travis-leith

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF.

#796

I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus.

There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist?

Aug 26 '24 07:08 grazy27

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

Nov 25 '24 02:11 wudanzy

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

Hello Dan <@wudanzy>,

That's fantastic news—great to hear!

I'd be happy to help with a few more issues to get this project back on track. In my opinion, the most important ones are:

Support for UDFs when UseArrow = true
Migrating to a standalone NuGet package for BinarySerializer and upgrading the solution to .NET 9
Addressing the bug with Databricks 15.4

Nov 25 '24 07:11 grazy27

Thanks for sharing that!

Nov 25 '24 10:11 wudanzy

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

Followed out in https://github.com/dotnet/spark/discussions/1179, and found a related bug: https://github.com/dotnet/spark/issues/1043

Dec 01 '24 15:12 grazy27

Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.

Dec 05 '24 09:12 wudanzy

Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.

Wonderful, thanks @wudanzy. I'll create a few more PRs with another improvements after this one is merged

Dec 05 '24 16:12 grazy27

/AzurePipelines run

Dec 14 '24 10:12 grazy27

Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark

Dec 14 '24 10:12 azure-pipelines[bot]

Hi @grazy27 , good news is that we (thanks @SparkSnail ) have fixed the workflow and now it works for external contributors, we have just upgraded to Spark 3.3 to verify that process. Please resolve the conflicts and we will trigger the test.

Dec 18 '24 03:12 wudanzy

Please note that the test must be triggered by a committer, it is the best outcome after discussing with internal admins. Please feel free to ping the thread when you want a new test.

Dec 18 '24 03:12 wudanzy

You can reuse this PR, will you use another one?

Dec 20 '24 23:12 wudanzy

Oops, thats automatic github action, I'll rebase on the weekend

Dec 21 '24 00:12 grazy27

/AzurePipelines run

Dec 22 '24 09:12 grazy27

Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark

Dec 22 '24 09:12 azure-pipelines[bot]

/AzurePipelines run

Dec 22 '24 09:12 wudanzy

Azure Pipelines successfully started running 1 pipeline(s).

Dec 22 '24 09:12 azure-pipelines[bot]

Seems that you need to change global.json.

You may verify if it works by build.cmd -pack -c Release /p:PublishSparkWorker=true /p:SparkWorkerPublishDir=D:\a\path\to\Microsoft.Spark.Worker

Dec 22 '24 14:12 wudanzy

@wudanzy thanks for pointing this out, better now.

When merging, let's use rebase or a merge, as otherwise commit history will not be visible in main

Dec 23 '24 11:12 grazy27

/AzurePipelines run

Dec 23 '24 13:12 wudanzy

Azure Pipelines successfully started running 1 pipeline(s).

Dec 23 '24 13:12 azure-pipelines[bot]

Are you able to reproduce the above errors locally? I didn't see the error before. Got some links from the web: https://lightrun.com/answers/dotnet-sourcelink-building-a-net-core-31-project-results-in-msb4062 https://github.com/dotnet/sourcelink/issues/386

And it is weird why it is dotnet 9.0 instead of 8.0.

Dec 23 '24 14:12 wudanzy

I had a guess that we should not use netstandard2.1, because all failing ones has netstandard2.1, but I am not sure.

Dec 23 '24 14:12 wudanzy

spark spark copied to clipboard

Spark 3.5.1, .NET 8, Dependencies and Documentation

Changes:

Tested with:

Spark

Databricks:

Affected tickets:

spark
spark copied to clipboard