spark
spark copied to clipboard
[DON'T MERGE] Try to replace all `json4s` with `Jackson`
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
cc @JoshRosen FYI
@JoshRosen I am not sure whether this draft is helpful for future work, but I hope it is useful to a certain extent.
This draft pr does not focus on forward compatibility, so I use Jackson JsonNode
instead of all Json4s JValue
, includes method parameters type and return value type, also includes objects used to serialize and deserialize Json. The test code still using json4s to test compatibility and all test should passed.
The change involves 5 modules core
, catalyst
, sql
, mllib
, kafka
, except that sql
is directly used for Row.jsonValue
in catalyst
, other modules are relatively independent.
For exporting HTTP APIs(org.apache.spark.deploy.JsonProtocol), JsonNode
is also used, and jsonResponderToServlet
method in JettyUtils
is adapted. Of course, we can also return a custom JsonResult
object instead of relying on JsonNode
A problem found in the rewriting process is that Json4s
has JNothing
, but Jackson does not, I used MissingNode
in Jackson instead and made special processing, for example:
if (!jsonNode.isMissingNode) {
node.set[JsonNode](name, jsonNode)
}
mima checks the following forward incompatibilities:
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.apply"),
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.types.DataType#JSortedObject.unapplySeq"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.expressions.MutableAggregationBuffer.jsonValue"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeMapToJValue"),
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeDoubleToJValue"),
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.param.FloatParam.jValueDecode"),
ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.param.FloatParam.jValueEncode"),
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.mllib.tree.model.TreeEnsembleModel#SaveLoadV1_0.readMetadata")
This is similar to the problem described in SPARK-39658, JValue
is directly used in the API.
It seems that the main problem is that JValue
is used as an input parameter or a return value, which will cause some forward incompatibility. I'm not sure whether we will change these APIs.
If the above problem is not a problem, I think spark may can replace Json4s
with Jackson
2ea9dcf add a simple benchmark, will update result later
I'm happy to see this PR!
Jackson is much safer than json4s
@LuciferYang Could you, please, check jsoniter-scala with your benchmarks if it worth to have an official integration with Spark ;)
@plokhotnyuk Let me learn about jsoniter-scala first
I was wrong, it seems that jackson-core is vulnerable too: https://github.com/FasterXML/jackson-module-scala/issues/609