mimir Extra validators

Small edits with far-reaching effects.

The big edits are:

New Type matchers
TypeRegistry is now a class rather than a global
There's now a distinction between BaseType (which most of the system knows how to handle) and Type (which also includes TUser, which needs TypeRegistry to handle).

Also cut out the OperatorTransform global database instance hack.

Dec 17 '18 05:12 okennedy

As line 54 in src/main/java/org/simdjson/SchemaBasedJsonIterator.java says: "Lists at the root are not supported. Consider using an array instead.". So current version cannot handle the case? For example, github_events.json.

The current version can handle it:

GithubEvent[] events = parser.parse(json, json.length, GithubEvent[].class);

However, we can try adding support for lists at the root. The reason I haven't done this yet is that, in Java, it's a bit challenging to pass information about a generic type parameter. We cannot do something like:

List<GithubEvent> events = parser.parse(json, json.length, List<GithubEvent>.class);

We can consider introducing an API like this:

List<GithubEvent> events = parser.parseList(json, json.length, GithubEvent.class);

Can the schema-based parsing be more powerful?

I'm open to that. However, the power of schema-based parsing is that we can skip parsing fields that are not included in the schema. For a Map, we would need to go through all fields. It would be interesting to see how this affects performance.

Please let me know what you think about it. Also, would you mind sharing if you use or consider using simdjson-java in any project? That would be very valuable information, especially if you could describe your use case (how much data you process, what your expectations are regarding performance, etc.).

May 15 '24 04:05 piotrrzysko

However, the power of schema-based parsing is that we can skip parsing fields that are not included in the schema. For a Map, we would need to go through all fields.

Thx. I get your point.

I'm talking about Map and List is because my current project is using UDFJson.java(https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFJson.java). I'd like to convert all Jackson to Simdjson.

Before schema-based parsing, I use simdJsonParser.parse() to parse json to JsonValue, and then use iterator to build a map, to match the objectMapper.readValue(jsonString, MAP_TYPE);. However, because of the twice loops(first in parsing, second in building map), the performance is bad.

Therefore, I believe the schema-based parsing to Map, even though will go though all fields, is faster than Jackson.

May 15 '24 09:05 ZhaiMo15

mimir mimir copied to clipboard

Extra validators

mimir
mimir copied to clipboard