Improve code duplication around Spark SQL Extensions
- Create folders
clients/spark-extensions-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logicalandclients/spark-extensions-common/src/main/scala/org/apache/spark/sql/execution/datasources/v2 - Move the existing
.scalafiles for these packages from clients/spark-extension/src/main/scala... to the new folders - Remove the
.../plans/logicaland.../datasources/v2directories inspark-extensionsandspark32-extensions - Create symlinks from
spark-extensionsandspark32-extensionsto the new folders - Let the "common"
*Commandand*Execclasses implement taits - maybe calledNessieCommandTraitandNessieExecTraitand implement those specifically in spark-exentions + spark32-extensions
See also https://github.com/projectnessie/nessie/pull/3319#pullrequestreview-885629501
I realized that symlinks might not be the best alternative here, sadly. If we, at some point, want to have proper incremental builds, we shouldn't use source code from a different module's file tree, because tools like GiB would not detect that a change in the "common" sources affects other modules.
But maybe it works a little bit different:
- introduce a new module
spark-extensions-common(or so) and move the common sources there - package that new module as a
jar - in
spark-extensionsandspark32-extensionspoms unpack the source of the new "spark-extensions-common" module to a separate folder underneath${project.build.directory}and add that as a source folder for scalac - for intellij, via a Maven profile activated when intellij's running: add symlinks from
spark-extensionsandspark32-extensionsto the source folder in the tree of the new "spark-extensions-common" module, let IJ know that the symlink is actually a scala source directory
@snazy I am not sure why do we need this step:
in spark-extensions and spark32-extensions poms unpack the source of the new "spark-extensions-common" module to a separate folder underneath ${project.build.directory} and add that as a source folder for scalac
Why both spark-extensions and spark32-extensions can't use the new module as a dependency?
Also, what about the spark-extensions-base module? Shouldn't we put the common code there?
You need to unpack it to extract the common source files.
You need to unpack it to extract the common source files.
Yes, but why we can't share the code as a normal maven dependency? This is in the same repository, so why do we need to package that new module as a jar?
Don't take the idea "package as a jar" as a strict requirement - it was more an initial thought.
The goal is to have mostly no duplicated code. Maven builds (via CI + locally) should still work (of course). The tricky part is that IntelliJ can import a fresh clone of the source tree w/o running Maven.
My findings. We have two sources of duplication.
The first one is in the plans.logical package of both spark-extensions and spark-3.2-extensions.
All the *Commands are case classes that are built with the types imported directly from spark_catalyst engine, fore example:
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
Here is an example ShowReferenceCommand from spark-3.2:
case class ShowReferenceCommand(val catalog: Option[String])
extends LeafCommand {
override lazy val output: Seq[Attribute] = new StructType(
Array[StructField](
StructField(
"refType",
DataTypes.StringType,
nullable = false,
Metadata.empty
),
StructField(
"name",
DataTypes.StringType,
nullable = false,
Metadata.empty
),
StructField(
"hash",
DataTypes.StringType,
nullable = false,
Metadata.empty
)
)
).toAttributes
override def simpleString(maxFields: Int): String = {
s"ShowReference"
}
}
and this is from 3.1:
case class AssignReferenceCommand(
reference: String,
isBranch: Boolean,
toRefName: Option[String],
toHash: Option[String],
catalog: Option[String]
) extends Command {
override lazy val output: Seq[Attribute] = new StructType(
Array[StructField](
StructField(
"refType",
DataTypes.StringType,
nullable = false,
Metadata.empty
),
StructField(
"name",
DataTypes.StringType,
nullable = false,
Metadata.empty
),
StructField(
"hash",
DataTypes.StringType,
nullable = false,
Metadata.empty
)
)
).toAttributes
override def simpleString(maxFields: Int): String = {
s"AssignReference ${reference}"
}
}
They look very similar but all the types used there are from a different library (spark 3.1 vs spark 3.2) so we cannot make it common. We could extract a trait for each command, for example:
trait AssignReferenceCommandTrait {
def reference: String
def isBranch: Boolean
def toRefName: Option[String]
def toHash: Option[String]
def catalog: Option[String]
}
and make the commands extend it, but it will not remove any duplication.
The second source of duplication:
The duplication in *Exec classes is already handled by the spark-extensions-base module and all Base* classes.
So to summarize, at this point I don't see a clear way to reduce duplication in those modules. Also, it is worth noting that the duplications come from the case classes that are anemic (without logic). Any thoughts, comments? (@snazy @nastra)
Oh man, I feared that it's rather impossible to reduce code duplication and have proper IDE + build-tool support.
I suspect, we have to bite the bullet and live with the duplicated code.
yeah I don't have any better suggestions as I was already trying to reduce code duplication when working on this initially and couldn't come up with anything better than what we have already
Okay - then let's keep this one open. I've added a new later label for those kinds of issues.
Today while analyzing one issue, I saw that
Even though BaseCreateReferenceExec is base for spark-3.1 and spark-3.2,
for spark-3.1 overrides both the methods of BaseCreateReferenceExec with exact same code from BaseCreateReferenceExec
I just removed BaseCreateReferenceExec and extended NessieExec and test was passing for both.
So, we should move BaseCreateReferenceExec to spark 3.2 as it is not really base ?
@ajantha-bhat I forgot to remove the implementation from those classes after I moved it to the base classes. https://github.com/projectnessie/nessie/pull/3673 cleans this up
@ajantha-bhat I forgot to remove the implementation from those classes after I moved it to the base classes. https://github.com/projectnessie/nessie/pull/3673 cleans this up
@nastra : perfect, I was also doing the same changes offline , you beat me in the race 😂
I suspect there's not much more we can do here considering that IDEs + Gradle need to work.