spark icon indicating copy to clipboard operation
spark copied to clipboard

Spark 3.5.1, .NET 8, Dependencies and Documentation

Open grazy27 opened this issue 1 year ago • 8 comments

Changes:

  • Implemented compatibility with Spark 3.5.1(Fixes in the separate commit),
  • Updated project dependencies
  • Updated .net6 -> .net8, .net461 -> .net48.
  • Extracted binary formatter in a separate class + tests
  • Fixed several small bugs, related to nullrefs or windows path with whitespaces.
  • Added a documentation page, that contains component and sequence diagrams for Dotnet Spark. (Such diagrams would help me significantly, it's worth adding)

Tested with:

Spark

Each time on stop there's an exception, that doesn't affect execution ERROR DotnetBackendHandler: Exception caught: java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)

  • Works with local 3.5.1
  • Works with local 3.5.2 (If setting 'IgnorePathVersion...' is enabled in scala)

Databricks:

  • Fails on 15.4:
[Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null], )
[2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
	at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
	at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)
Error from python worker:
  DotnetWorker PID:[2516] Args:[-m pyspark.daemon pyspark.worker] SparkVersion:[3.5.0]
  Invalid number of args: 3

After fix 1: Grouped UDFs were reworked in DB, and now accept series of batches(10k rows each) instead of a single huge one. Also now there's a need to read 2x int in between of these batches, and merge them before pasiing to UDF. Fix 2: UDF now accepts IEnumerable of RecordBatches

After fix 2: Databricks accepts all results, but I can't escape the final command loop and finish the worker

Affected tickets:

  • #1170

grazy27 avatar Jul 08 '24 20:07 grazy27

@dotnet-policy-service agree

grazy27 avatar Jul 08 '24 20:07 grazy27

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

image

GeorgeS2019 avatar Jul 22 '24 04:07 GeorgeS2019

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

image

Hello @GeorgeS2019 , they do. image

Saw your issue, probably my env uses UTF8 by default. Several tests fail from time to time with executor driver): java.nio.file.NoSuchFileException: C:\Users\grazy27\AppData\Local\Temp\spark-cc2cf7bc-3c8c-4fdf-a496-266424de943d\userFiles-92d122bb-af9a-40ea-a430-131454afc705\archive.zip But they pass if run second time, so I didn't dive deeper

grazy27 avatar Jul 22 '24 11:07 grazy27

What is the status of this PR?

travis-leith avatar Aug 26 '24 07:08 travis-leith

What is the status of this PR?

It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side.

I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well

grazy27 avatar Aug 26 '24 07:08 grazy27

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

https://github.com/dotnet/spark/issues/796

image https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

GeorgeS2019 avatar Aug 26 '24 07:08 GeorgeS2019

The next steps are on Microsoft's side.

Any idea who is "in charge" of this repo?

travis-leith avatar Aug 26 '24 07:08 travis-leith

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF.

#796

I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus.

There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist?

grazy27 avatar Aug 26 '24 07:08 grazy27

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

wudanzy avatar Nov 25 '24 02:11 wudanzy

Hi Ihor (@grazy27), thanks for the contribution! I recently get the write permission of this repo and happy to move this forward. Due to limited bandwidth in our team and other priorities, we don't have concrete work items on this project. But we can review your code and let's work together to move this forward!

Hello Dan <@wudanzy>,

That's fantastic news—great to hear!

I'd be happy to help with a few more issues to get this project back on track. In my opinion, the most important ones are:

  • Support for UDFs when UseArrow = true
  • Migrating to a standalone NuGet package for BinarySerializer and upgrading the solution to .NET 9
  • Addressing the bug with Databricks 15.4

grazy27 avatar Nov 25 '24 07:11 grazy27

Thanks for sharing that!

wudanzy avatar Nov 25 '24 10:11 wudanzy

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

image https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

Followed out in https://github.com/dotnet/spark/discussions/1179, and found a related bug: https://github.com/dotnet/spark/issues/1043

grazy27 avatar Dec 01 '24 15:12 grazy27

Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.

wudanzy avatar Dec 05 '24 09:12 wudanzy

Hi @grazy27, it looks to me, but we have to wait for the check and another review. @SparkSnail is fixing the broken check and will do another review.

Wonderful, thanks @wudanzy. I'll create a few more PRs with another improvements after this one is merged

grazy27 avatar Dec 05 '24 16:12 grazy27

/AzurePipelines run

grazy27 avatar Dec 14 '24 10:12 grazy27

Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark

azure-pipelines[bot] avatar Dec 14 '24 10:12 azure-pipelines[bot]

Hi @grazy27 , good news is that we (thanks @SparkSnail ) have fixed the workflow and now it works for external contributors, we have just upgraded to Spark 3.3 to verify that process. Please resolve the conflicts and we will trigger the test.

wudanzy avatar Dec 18 '24 03:12 wudanzy

Please note that the test must be triggered by a committer, it is the best outcome after discussing with internal admins. Please feel free to ping the thread when you want a new test.

wudanzy avatar Dec 18 '24 03:12 wudanzy

You can reuse this PR, will you use another one?

wudanzy avatar Dec 20 '24 23:12 wudanzy

Oops, thats automatic github action, I'll rebase on the weekend

grazy27 avatar Dec 21 '24 00:12 grazy27

/AzurePipelines run

grazy27 avatar Dec 22 '24 09:12 grazy27

Commenter does not have sufficient privileges for PR 1178 in repo dotnet/spark

azure-pipelines[bot] avatar Dec 22 '24 09:12 azure-pipelines[bot]

/AzurePipelines run

wudanzy avatar Dec 22 '24 09:12 wudanzy

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Dec 22 '24 09:12 azure-pipelines[bot]

Seems that you need to change global.json.

You may verify if it works by build.cmd -pack -c Release /p:PublishSparkWorker=true /p:SparkWorkerPublishDir=D:\a\path\to\Microsoft.Spark.Worker

wudanzy avatar Dec 22 '24 14:12 wudanzy

@wudanzy thanks for pointing this out, better now. image

When merging, let's use rebase or a merge, as otherwise commit history will not be visible in main

grazy27 avatar Dec 23 '24 11:12 grazy27

/AzurePipelines run

wudanzy avatar Dec 23 '24 13:12 wudanzy

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Dec 23 '24 13:12 azure-pipelines[bot]

Are you able to reproduce the above errors locally? I didn't see the error before. Got some links from the web: https://lightrun.com/answers/dotnet-sourcelink-building-a-net-core-31-project-results-in-msb4062 https://github.com/dotnet/sourcelink/issues/386

And it is weird why it is dotnet 9.0 instead of 8.0.

wudanzy avatar Dec 23 '24 14:12 wudanzy

I had a guess that we should not use netstandard2.1, because all failing ones has netstandard2.1, but I am not sure.

wudanzy avatar Dec 23 '24 14:12 wudanzy