nessie icon indicating copy to clipboard operation
nessie copied to clipboard

Improve code duplication around Spark SQL Extensions

Open nastra opened this issue 3 years ago • 12 comments

  • Create folders clients/spark-extensions-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logical and clients/spark-extensions-common/src/main/scala/org/apache/spark/sql/execution/datasources/v2
  • Move the existing .scala files for these packages from clients/spark-extension/src/main/scala... to the new folders
  • Remove the .../plans/logical and .../datasources/v2 directories in spark-extensions and spark32-extensions
  • Create symlinks from spark-extensions and spark32-extensions to the new folders
  • Let the "common" *Command and *Exec classes implement taits - maybe called NessieCommandTrait and NessieExecTrait and implement those specifically in spark-exentions + spark32-extensions

See also https://github.com/projectnessie/nessie/pull/3319#pullrequestreview-885629501

nastra avatar Feb 17 '22 13:02 nastra

I realized that symlinks might not be the best alternative here, sadly. If we, at some point, want to have proper incremental builds, we shouldn't use source code from a different module's file tree, because tools like GiB would not detect that a change in the "common" sources affects other modules.

But maybe it works a little bit different:

  • introduce a new module spark-extensions-common (or so) and move the common sources there
  • package that new module as a jar
  • in spark-extensions and spark32-extensions poms unpack the source of the new "spark-extensions-common" module to a separate folder underneath ${project.build.directory} and add that as a source folder for scalac
  • for intellij, via a Maven profile activated when intellij's running: add symlinks from spark-extensions and spark32-extensions to the source folder in the tree of the new "spark-extensions-common" module, let IJ know that the symlink is actually a scala source directory

snazy avatar Feb 24 '22 07:02 snazy

@snazy I am not sure why do we need this step:

in spark-extensions and spark32-extensions poms unpack the source of the new "spark-extensions-common" module to a separate folder underneath ${project.build.directory} and add that as a source folder for scalac

Why both spark-extensions and spark32-extensions can't use the new module as a dependency?

Also, what about the spark-extensions-base module? Shouldn't we put the common code there?

tomekl007 avatar Feb 28 '22 11:02 tomekl007

You need to unpack it to extract the common source files.

snazy avatar Feb 28 '22 11:02 snazy

You need to unpack it to extract the common source files.

Yes, but why we can't share the code as a normal maven dependency? This is in the same repository, so why do we need to package that new module as a jar?

tomekl007 avatar Feb 28 '22 12:02 tomekl007

Don't take the idea "package as a jar" as a strict requirement - it was more an initial thought.

The goal is to have mostly no duplicated code. Maven builds (via CI + locally) should still work (of course). The tricky part is that IntelliJ can import a fresh clone of the source tree w/o running Maven.

snazy avatar Feb 28 '22 13:02 snazy

My findings. We have two sources of duplication.

The first one is in the plans.logical package of both spark-extensions and spark-3.2-extensions. All the *Commands are case classes that are built with the types imported directly from spark_catalyst engine, fore example: import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} Here is an example ShowReferenceCommand from spark-3.2:


case class ShowReferenceCommand(val catalog: Option[String])
    extends LeafCommand {
  override lazy val output: Seq[Attribute] = new StructType(
    Array[StructField](
      StructField(
        "refType",
        DataTypes.StringType,
        nullable = false,
        Metadata.empty
      ),
      StructField(
        "name",
        DataTypes.StringType,
        nullable = false,
        Metadata.empty
      ),
      StructField(
        "hash",
        DataTypes.StringType,
        nullable = false,
        Metadata.empty
      )
    )
  ).toAttributes

  override def simpleString(maxFields: Int): String = {
    s"ShowReference"
  }
}

and this is from 3.1:

case class AssignReferenceCommand(
    reference: String,
    isBranch: Boolean,
    toRefName: Option[String],
    toHash: Option[String],
    catalog: Option[String]
) extends Command {
  override lazy val output: Seq[Attribute] = new StructType(
    Array[StructField](
      StructField(
        "refType",
        DataTypes.StringType,
        nullable = false,
        Metadata.empty
      ),
      StructField(
        "name",
        DataTypes.StringType,
        nullable = false,
        Metadata.empty
      ),
      StructField(
        "hash",
        DataTypes.StringType,
        nullable = false,
        Metadata.empty
      )
    )
  ).toAttributes

  override def simpleString(maxFields: Int): String = {
    s"AssignReference ${reference}"
  }
}

They look very similar but all the types used there are from a different library (spark 3.1 vs spark 3.2) so we cannot make it common. We could extract a trait for each command, for example:

trait AssignReferenceCommandTrait {
  def reference: String
  def isBranch: Boolean
  def toRefName: Option[String]
  def toHash: Option[String]
  def catalog: Option[String]
}

and make the commands extend it, but it will not remove any duplication.

The second source of duplication: The duplication in *Exec classes is already handled by the spark-extensions-base module and all Base* classes.

So to summarize, at this point I don't see a clear way to reduce duplication in those modules. Also, it is worth noting that the duplications come from the case classes that are anemic (without logic). Any thoughts, comments? (@snazy @nastra)

tomekl007 avatar Mar 01 '22 11:03 tomekl007

Oh man, I feared that it's rather impossible to reduce code duplication and have proper IDE + build-tool support.

I suspect, we have to bite the bullet and live with the duplicated code.

snazy avatar Mar 01 '22 11:03 snazy

yeah I don't have any better suggestions as I was already trying to reduce code duplication when working on this initially and couldn't come up with anything better than what we have already

nastra avatar Mar 01 '22 13:03 nastra

Okay - then let's keep this one open. I've added a new later label for those kinds of issues.

snazy avatar Mar 01 '22 13:03 snazy

Today while analyzing one issue, I saw that Even though BaseCreateReferenceExec is base for spark-3.1 and spark-3.2, for spark-3.1 overrides both the methods of BaseCreateReferenceExec with exact same code from BaseCreateReferenceExec

I just removed BaseCreateReferenceExec and extended NessieExec and test was passing for both. So, we should move BaseCreateReferenceExec to spark 3.2 as it is not really base ?

ajantha-bhat avatar Mar 21 '22 06:03 ajantha-bhat

@ajantha-bhat I forgot to remove the implementation from those classes after I moved it to the base classes. https://github.com/projectnessie/nessie/pull/3673 cleans this up

nastra avatar Mar 21 '22 07:03 nastra

@ajantha-bhat I forgot to remove the implementation from those classes after I moved it to the base classes. https://github.com/projectnessie/nessie/pull/3673 cleans this up

@nastra : perfect, I was also doing the same changes offline , you beat me in the race 😂

ajantha-bhat avatar Mar 21 '22 07:03 ajantha-bhat

I suspect there's not much more we can do here considering that IDEs + Gradle need to work.

snazy avatar Sep 06 '22 11:09 snazy