spark icon indicating copy to clipboard operation
spark copied to clipboard

[DON'T MERGE] Try to replace all `json4s` with `Jackson`

Open LuciferYang opened this issue 2 years ago • 6 comments

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

LuciferYang avatar Aug 22 '22 02:08 LuciferYang

cc @JoshRosen FYI

HyukjinKwon avatar Aug 22 '22 03:08 HyukjinKwon

@JoshRosen I am not sure whether this draft is helpful for future work, but I hope it is useful to a certain extent.

This draft pr does not focus on forward compatibility, so I use Jackson JsonNode instead of all Json4s JValue, includes method parameters type and return value type, also includes objects used to serialize and deserialize Json. The test code still using json4s to test compatibility and all test should passed.

The change involves 5 modules core, catalyst, sql, mllib, kafka, except that sql is directly used for Row.jsonValue in catalyst, other modules are relatively independent.

For exporting HTTP APIs(org.apache.spark.deploy.JsonProtocol), JsonNode is also used, and jsonResponderToServlet method in JettyUtils is adapted. Of course, we can also return a custom JsonResult object instead of relying on JsonNode

A problem found in the rewriting process is that Json4s has JNothing, but Jackson does not, I used MissingNode in Jackson instead and made special processing, for example:

if (!jsonNode.isMissingNode) {
  node.set[JsonNode](name, jsonNode)
}

mima checks the following forward incompatibilities:

 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.deploy.DeployMessages#RequestExecutors.apply"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.types.DataType#JSortedObject.unapplySeq"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.expressions.MutableAggregationBuffer.jsonValue"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeMapToJValue"),
 ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeDoubleToJValue"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.ml.param.FloatParam.jValueDecode"),
 ProblemFilters.exclude[IncompatibleResultTypeProblem]("org.apache.spark.ml.param.FloatParam.jValueEncode"),
 ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.mllib.tree.model.TreeEnsembleModel#SaveLoadV1_0.readMetadata")

This is similar to the problem described in SPARK-39658, JValue is directly used in the API.

LuciferYang avatar Aug 22 '22 03:08 LuciferYang

It seems that the main problem is that JValue is used as an input parameter or a return value, which will cause some forward incompatibility. I'm not sure whether we will change these APIs.

If the above problem is not a problem, I think spark may can replace Json4s with Jackson

LuciferYang avatar Aug 22 '22 03:08 LuciferYang

2ea9dcf add a simple benchmark, will update result later

LuciferYang avatar Aug 24 '22 04:08 LuciferYang

I'm happy to see this PR!

Jackson is much safer than json4s

@LuciferYang Could you, please, check jsoniter-scala with your benchmarks if it worth to have an official integration with Spark ;)

plokhotnyuk avatar Sep 07 '22 14:09 plokhotnyuk

@plokhotnyuk Let me learn about jsoniter-scala first

LuciferYang avatar Sep 09 '22 06:09 LuciferYang

Jackson is much safer than json4s

I was wrong, it seems that jackson-core is vulnerable too: https://github.com/FasterXML/jackson-module-scala/issues/609

plokhotnyuk avatar Oct 21 '22 09:10 plokhotnyuk