nessie
nessie copied to clipboard
Improve code duplication around Spark SQL Extensions
- Create folders
clients/spark-extensions-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logical
andclients/spark-extensions-common/src/main/scala/org/apache/spark/sql/execution/datasources/v2
- Move the existing
.scala
files for these packages from clients/spark-extension/src/main/scala... to the new folders - Remove the
.../plans/logical
and.../datasources/v2
directories inspark-extensions
andspark32-extensions
- Create symlinks from
spark-extensions
andspark32-extensions
to the new folders - Let the "common"
*Command
and*Exec
classes implement taits - maybe calledNessieCommandTrait
andNessieExecTrait
and implement those specifically in spark-exentions + spark32-extensions
See also https://github.com/projectnessie/nessie/pull/3319#pullrequestreview-885629501
I realized that symlinks might not be the best alternative here, sadly. If we, at some point, want to have proper incremental builds, we shouldn't use source code from a different module's file tree, because tools like GiB would not detect that a change in the "common" sources affects other modules.
But maybe it works a little bit different:
- introduce a new module
spark-extensions-common
(or so) and move the common sources there - package that new module as a
jar
- in
spark-extensions
andspark32-extensions
poms unpack the source of the new "spark-extensions-common" module to a separate folder underneath${project.build.directory}
and add that as a source folder for scalac - for intellij, via a Maven profile activated when intellij's running: add symlinks from
spark-extensions
andspark32-extensions
to the source folder in the tree of the new "spark-extensions-common" module, let IJ know that the symlink is actually a scala source directory
@snazy I am not sure why do we need this step:
in spark-extensions and spark32-extensions poms unpack the source of the new "spark-extensions-common" module to a separate folder underneath ${project.build.directory} and add that as a source folder for scalac
Why both spark-extensions
and spark32-extensions
can't use the new module as a dependency?
Also, what about the spark-extensions-base
module? Shouldn't we put the common code there?
You need to unpack it to extract the common source files.
You need to unpack it to extract the common source files.
Yes, but why we can't share the code as a normal maven dependency? This is in the same repository, so why do we need to package that new module as a jar
?
Don't take the idea "package as a jar" as a strict requirement - it was more an initial thought.
The goal is to have mostly no duplicated code. Maven builds (via CI + locally) should still work (of course). The tricky part is that IntelliJ can import a fresh clone of the source tree w/o running Maven.
My findings. We have two sources of duplication.
The first one is in the plans.logical
package of both spark-extensions
and spark-3.2-extensions
.
All the *Commands
are case classes that are built with the types imported directly from spark_catalyst
engine, fore example:
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
Here is an example ShowReferenceCommand
from spark-3.2
:
case class ShowReferenceCommand(val catalog: Option[String])
extends LeafCommand {
override lazy val output: Seq[Attribute] = new StructType(
Array[StructField](
StructField(
"refType",
DataTypes.StringType,
nullable = false,
Metadata.empty
),
StructField(
"name",
DataTypes.StringType,
nullable = false,
Metadata.empty
),
StructField(
"hash",
DataTypes.StringType,
nullable = false,
Metadata.empty
)
)
).toAttributes
override def simpleString(maxFields: Int): String = {
s"ShowReference"
}
}
and this is from 3.1
:
case class AssignReferenceCommand(
reference: String,
isBranch: Boolean,
toRefName: Option[String],
toHash: Option[String],
catalog: Option[String]
) extends Command {
override lazy val output: Seq[Attribute] = new StructType(
Array[StructField](
StructField(
"refType",
DataTypes.StringType,
nullable = false,
Metadata.empty
),
StructField(
"name",
DataTypes.StringType,
nullable = false,
Metadata.empty
),
StructField(
"hash",
DataTypes.StringType,
nullable = false,
Metadata.empty
)
)
).toAttributes
override def simpleString(maxFields: Int): String = {
s"AssignReference ${reference}"
}
}
They look very similar but all the types used there are from a different library (spark 3.1 vs spark 3.2) so we cannot make it common. We could extract a trait for each command, for example:
trait AssignReferenceCommandTrait {
def reference: String
def isBranch: Boolean
def toRefName: Option[String]
def toHash: Option[String]
def catalog: Option[String]
}
and make the commands extend it, but it will not remove any duplication.
The second source of duplication:
The duplication in *Exec
classes is already handled by the spark-extensions-base
module and all Base*
classes.
So to summarize, at this point I don't see a clear way to reduce duplication in those modules. Also, it is worth noting that the duplications come from the case classes that are anemic (without logic). Any thoughts, comments? (@snazy @nastra)
Oh man, I feared that it's rather impossible to reduce code duplication and have proper IDE + build-tool support.
I suspect, we have to bite the bullet and live with the duplicated code.
yeah I don't have any better suggestions as I was already trying to reduce code duplication when working on this initially and couldn't come up with anything better than what we have already
Okay - then let's keep this one open. I've added a new later
label for those kinds of issues.
Today while analyzing one issue, I saw that
Even though BaseCreateReferenceExec
is base for spark-3.1 and spark-3.2,
for spark-3.1 overrides both the methods of BaseCreateReferenceExec
with exact same code from BaseCreateReferenceExec
I just removed BaseCreateReferenceExec and extended NessieExec and test was passing for both.
So, we should move BaseCreateReferenceExec
to spark 3.2 as it is not really base ?
@ajantha-bhat I forgot to remove the implementation from those classes after I moved it to the base classes. https://github.com/projectnessie/nessie/pull/3673 cleans this up
@ajantha-bhat I forgot to remove the implementation from those classes after I moved it to the base classes. https://github.com/projectnessie/nessie/pull/3673 cleans this up
@nastra : perfect, I was also doing the same changes offline , you beat me in the race 😂
I suspect there's not much more we can do here considering that IDEs + Gradle need to work.