simdjson-java icon indicating copy to clipboard operation
simdjson-java copied to clipboard

Further discussion of schema-based parsing.

Open ZhaiMo15 opened this issue 1 year ago • 2 comments
trafficstars

Good to know that the schema-based parsing were implemented! I have two more questions:

  1. As line 54 in src/main/java/org/simdjson/SchemaBasedJsonIterator.java says: "Lists at the root are not supported. Consider using an array instead.". So current version cannot handle the case? For example, github_events.json.
  2. Can the schema-based parsing be more powerful? In current version, we need to explicitly tell the parser the class:
SimdJsonTwitter twitter = simdJsonParser.parse(buffer, buffer.length, SimdJsonTwitter.class);

record SimdJsonUser(boolean default_profile, String screen_name) {
}

record SimdJsonStatus(SimdJsonUser user) {
}

record SimdJsonTwitter(List<SimdJsonStatus> statuses) {
}

In Jackson, we can use readValue to parse json into Map (or List), in that case, we don't need to define lots of "record" if the class is complicated. In one word, something like Object twitter = simdJsonParser.parse(buffer, buffer.length, Map.class);

ZhaiMo15 avatar May 11 '24 07:05 ZhaiMo15

As line 54 in src/main/java/org/simdjson/SchemaBasedJsonIterator.java says: "Lists at the root are not supported. Consider using an array instead.". So current version cannot handle the case? For example, github_events.json.

The current version can handle it:

GithubEvent[] events = parser.parse(json, json.length, GithubEvent[].class);

However, we can try adding support for lists at the root. The reason I haven't done this yet is that, in Java, it's a bit challenging to pass information about a generic type parameter. We cannot do something like:

List<GithubEvent> events = parser.parse(json, json.length, List<GithubEvent>.class);

We can consider introducing an API like this:

List<GithubEvent> events = parser.parseList(json, json.length, GithubEvent.class);

Can the schema-based parsing be more powerful?

I'm open to that. However, the power of schema-based parsing is that we can skip parsing fields that are not included in the schema. For a Map, we would need to go through all fields. It would be interesting to see how this affects performance.

Please let me know what you think about it. Also, would you mind sharing if you use or consider using simdjson-java in any project? That would be very valuable information, especially if you could describe your use case (how much data you process, what your expectations are regarding performance, etc.).

piotrrzysko avatar May 15 '24 04:05 piotrrzysko

However, the power of schema-based parsing is that we can skip parsing fields that are not included in the schema. For a Map, we would need to go through all fields.

Thx. I get your point.

I'm talking about Map and List is because my current project is using UDFJson.java(https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFJson.java). I'd like to convert all Jackson to Simdjson.

Before schema-based parsing, I use simdJsonParser.parse() to parse json to JsonValue, and then use iterator to build a map, to match the objectMapper.readValue(jsonString, MAP_TYPE);. However, because of the twice loops(first in parsing, second in building map), the performance is bad.

Therefore, I believe the schema-based parsing to Map, even though will go though all fields, is faster than Jackson.

ZhaiMo15 avatar May 15 '24 09:05 ZhaiMo15