incubator-graphar [Feat][Spark] Run tests against multiple spark version && relax spark version requirements

Is your feature request related to a problem? Please describe. Currently only Spark 3.2 + Hadoop 3.2 + Scala 2.12 is supported. But from the first view the spark/pyspark code looks like it should work with all the version of spark starting from 3.0 and with both Hadoop 2 and Hadoop 3. Also I do not see any blockers for support of scala 2.13.

Describe the solution you'd like

Provide multiple builds of graphar-spark, like graphar-spark-1.0-spark-3.1, graphar-spark-1.0-spark-3.2, etc. For me a nice example is aws deequ
Test pyspark bindings against multiple versions of pyspark and python. A good example of workflow is pydeequ
Relax spark requirements for pyspark package
Relax hadoop-3.2 requirements for spark/pyspark packages

Describe alternatives you've considered I do not know an alternative solutions.

Additional context In my experience, spark packages are used mostly on pre-configured clusters with fixed environment (like Databricks Runtimes). That's why it is important to relax spark-version requirements as much as possible. Otherwise, the package will be impossible to install just due dependencies conflicts.

Jan 10 '24 09:01 SemyonSinchenko

Good suggestion. I think we can make this issue as a tracking issue that keep track of a list task to support this feature.

Jan 12 '24 03:01 acezen

Ok. It is not as easy as I thought. The problem is mostly in datasources; a lot of things are changing here from version to version. What would be a better solution? For example, in 3.2 ParquetPartitionReaderFactory does not have an aggregation argument. But in 3.3 it does.

Option one would be something like this:

val parquetPartitionReaderFactoryClassConstructor = Class
  .forName(
    "org.apache.spark.sql.execution.datasources.v2.parquet.ParquetPartitionReaderFactory"
  )
  .getMethod("apply")

sparkSession.version match {
  case v if v.startsWith("3.2") =>
    parquetPartitionReaderFactoryClassConstructor
      .invoke(
        null,
        sqlConf,
        broadcastedConf,
        dataSchema,
        readDataSchema,
        readPartitionSchema,
        pushedFilters,
        new ParquetOptions(
          options.asCaseSensitiveMap.asScala.toMap,
          sqlConf
        )
      )
      .asInstanceOf[ParquetPartitionReaderFactory]
  case v if v.startsWith("3.3") =>
    parquetPartitionReaderFactoryClassConstructor
      .invoke(
        null,
        sqlConf,
        broadcastedConf,
        dataSchema,
        readDataSchema,
        readPartitionSchema,
        pushedFilters,
        None, // The newly added argument that is Option
        new ParquetOptions(
          options.asCaseSensitiveMap.asScala.toMap,
          sqlConf
        )
      )
      .asInstanceOf[ParquetPartitionReaderFactory]
}

(looks terrible)

Option two will be something like having bundle subdirectories and subprojects (we may have it only for datasource). Or to have branch-tags like spark3.2, spark3.3, etc.

@acezen What do you think about it? To be honest, I did not face such a case in my experience and I do not know what would be the better option.

Jan 29 '24 19:01 SemyonSinchenko

UPD: I had a talk with a couple of scala experienced guys about it, and it looks like reflection is a valid solution. Even spark itself uses reflection for working with different versions of Hive. So, I suggest going with reflection. I can do it at least for the scope of versions from 3.1.x to 3.4.x

Jan 30 '24 08:01 SemyonSinchenko

UPD: I had a talk with a couple of scala experienced guys about it, and it looks like reflection is a valid solution. Even spark itself uses reflection for working with different versions of Hive. So, I suggest going with reflection. I can do it at least for the scope of versions from 3.1.x to 3.4.x

reflection looks good to me. As you said, spark uses reflection in the same way, we can follow the strategy. hi, @lixueclaire, do you have any insight about the solution Sem list above?

Jan 30 '24 09:01 acezen

Hi, @SemyonSinchenko, I'm impressed with your proposals—they seem quite solid. I did notice, though, that certain features such as the "ZSTD" compression, are supported since Spark v3.2. To maintain compatibility with Spark v3.1.x, could we look into making these particular features optional?

Jan 30 '24 09:01 lixueclaire

Hi, @SemyonSinchenko, I'm impressed with your proposals—they seem quite solid. I did notice, though, that certain features such as the "ZSTD" compression, are supported since Spark v3.2. To maintain compatibility with Spark v3.1.x, could we look into making these particular features optional?

I would suggest just using other compression for older versions of spark. Or we can start with something like 3.2.x - 3.4.x. And extend the support only in the case of issues from users..

Jan 30 '24 10:01 SemyonSinchenko

Hi, @SemyonSinchenko, I'm impressed with your proposals—they seem quite solid. I did notice, though, that certain features such as the "ZSTD" compression, are supported since Spark v3.2. To maintain compatibility with Spark v3.1.x, could we look into making these particular features optional?

I would suggest just using other compression for older versions of spark. Or we can start with something like 3.2.x - 3.4.x. And extend the support only in the case of issues from users.. we can just using snappy compression for older version of spark.

Jan 30 '24 11:01 acezen

Bad news: I tried hard but failed to fix everything. A lot of changes from 3.2 to 3.3 in the part of FiltersPushDown. What do you think about splitting datasources to datasources.common and datasources.spark32/datasources.spark33/etc.? @acezen @lixueclaire

For example, there are cases when the signature of the parent class was changed and I got errors like error: method withFilters overrides nothing. And I have no idea how to fix it with reflection-trick.

Jan 30 '24 21:01 SemyonSinchenko

Bad news: I tried hard but failed to fix everything. A lot of changes from 3.2 to 3.3 in the part of FiltersPushDown. What do you think about splitting datasources to datasources.common and datasources.spark32/datasources.spark33/etc.? @acezen @lixueclaire

For example, there are cases when the signature of the parent class was changed and I got errors like error: method withFilters overrides nothing. And I have no idea how to fix it with reflection-trick.

Yes，the datasource part may change a lot for different version of Spark, feel free to change the solution and choose the most appropriate for the moment.

Jan 30 '24 23:01 acezen

hi, @SemyonSinchenko, if you have some work on the issue, you can just create a working in progress PR and I think make the issue splits to some small commits will help back tracing.

Feb 05 '24 03:02 acezen

[x] Split datasources and core in GraphAr spark
[x] Support spark 3.3.x as an additional Maven profile and a separate datasoutces bundle
[ ] Provide multiple spark JARs in Maven
[ ] Relax GraphAr PySpark requirements from pyspark 3.2 to ">=3.2,<=3.3"
[x] Create a page in the documentation about versions compatibility

Feb 23 '24 07:02 SemyonSinchenko

incubator-graphar incubator-graphar copied to clipboard

[Feat][Spark] Run tests against multiple spark version && relax spark version requirements

incubator-graphar
incubator-graphar copied to clipboard