PipelineDP icon indicating copy to clipboard operation
PipelineDP copied to clipboard

Spark 2.X.X support?

Open SemyonSinchenko opened this issue 3 years ago • 2 comments

Question

Is there support of the 2.X.X versions of Apache Spark?

Further Information

I see in pyproject.toml pyspark 3.2.0 dependency. But in real enerprise and on-premise clusters typically version is 2.X.X. Is there support of any Spark version except 3.2.0?

Screenshots

If applicable, add screenshots to help explain your question.

System Information

  • OS: RHEL
  • OS Version: 8
  • Language Version: 3.7
  • Package Manager Version: PIP

Additional Context

It is good to see the list of supported Spark/Besm versions but I couldn't find it. Maybe there is one? In that case could you please get me a link? Thank you!

SemyonSinchenko avatar Jan 29 '22 07:01 SemyonSinchenko

We haven't tested yet on 2.X, though I think it should be easy to make support 2.X (or even it might work with 2.X out of the box). That's because PipelineDP needs only some basic APIs from RDD (no yet support of other Spark API as DataFrames) - like map, reduceByKey, join etc. You can see all used Spark API in SparkRDDBackend class. If you have any feedback on using Spark please LMK. Also if you test it with Spark 2.* please LMK results.

In the next release, we will remove limitation on 3.2.0.

dvadym avatar Jan 29 '22 17:01 dvadym

Thanks a lot for a such fast answer. I'll write a comment here about my tests on Spark 2.3.0.

SemyonSinchenko avatar Jan 29 '22 19:01 SemyonSinchenko