spark [DON'T MERGE] Try to replace all `json4s` with `Jackson`

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Aug 22 '22 02:08 LuciferYang

cc @JoshRosen FYI

Aug 22 '22 03:08 HyukjinKwon

@JoshRosen I am not sure whether this draft is helpful for future work, but I hope it is useful to a certain extent.

This draft pr does not focus on forward compatibility, so I use Jackson JsonNode instead of all Json4s JValue, includes method parameters type and return value type, also includes objects used to serialize and deserialize Json. The test code still using json4s to test compatibility and all test should passed.

The change involves 5 modules core, catalyst, sql, mllib, kafka, except that sql is directly used for Row.jsonValue in catalyst, other modules are relatively independent.

For exporting HTTP APIs(org.apache.spark.deploy.JsonProtocol), JsonNode is also used, and jsonResponderToServlet method in JettyUtils is adapted. Of course, we can also return a custom JsonResult object instead of relying on JsonNode

A problem found in the rewriting process is that Json4s has JNothing, but Jackson does not, I used MissingNode in Jackson instead and made special processing, for example:

if (!jsonNode.isMissingNode) {
  node.set[JsonNode](name, jsonNode)
}

mima checks the following forward incompatibilities:

 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.apply"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.types.DataType#JSortedObject.unapplySeq"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.expressions.MutableAggregationBuffer.jsonValue"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeMapToJValue"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeDoubleToJValue"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.param.FloatParam.jValueDecode"),
 ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.param.FloatParam.jValueEncode"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.mllib.tree.model.TreeEnsembleModel#SaveLoadV1_0.readMetadata")

This is similar to the problem described in SPARK-39658, JValue is directly used in the API.

Aug 22 '22 03:08 LuciferYang

It seems that the main problem is that JValue is used as an input parameter or a return value, which will cause some forward incompatibility. I'm not sure whether we will change these APIs.

If the above problem is not a problem, I think spark may can replace Json4s with Jackson

Aug 22 '22 03:08 LuciferYang

2ea9dcf add a simple benchmark, will update result later

Aug 24 '22 04:08 LuciferYang

I'm happy to see this PR!

Jackson is much safer than json4s

@LuciferYang Could you, please, check jsoniter-scala with your benchmarks if it worth to have an official integration with Spark ;)

Sep 07 '22 14:09 plokhotnyuk

@plokhotnyuk Let me learn about jsoniter-scala first

Sep 09 '22 06:09 LuciferYang

Jackson is much safer than json4s

I was wrong, it seems that jackson-core is vulnerable too: https://github.com/FasterXML/jackson-module-scala/issues/609

Oct 21 '22 09:10 plokhotnyuk

spark spark copied to clipboard

[DON'T MERGE] Try to replace all `json4s` with `Jackson`

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

spark
spark copied to clipboard