simdjson
simdjson copied to clipboard
Detect if any keys have non-ASCII characters
Non-ascii keys seem pretty weird, but I guess they must be out there. Normally, there are no keys like this and so you can deal with strings in a fast and simple way, e.g. using strcmp
without having to canonicalize both strings being compared. So, it would be nice if there was some indicator that at some point in the JSON document there is a non-ASCII character in a key, causing parsers built on top of simdjson to be more careful. It's easy to vectorize the check for non-ascii characters. I don't know what priority this should have, or the exact API (a simple bool?), but I thought I would bring it up.
I think that your issue is valid. It is related to some existing issues like...
https://github.com/lemire/simdjson/issues/184
The main idea is that, often, keys are short and ASCII.
Whether there is any sense in having such a flag that users of the library can access... I do not know... but currently, simdjson does not attempt to take advantage of this at all.
I am curious how the devs of the other libs built on simdjson feel about this, and I still don't know the real-world occurrence rate of non-ASCII JSON. From there, here are some options:
- leave as-is
- detect
- detect and canonicalize before writing to tape
By non-ASCII JSON, I think you refer to the keys... because there is lots of non-ASCII characters in JSON documents, evidently.
I think it may make sense to detect this. I think we want to reengineer in any case, because it makes sense to treat keys differently, something we do not do currently.
Makes sense, and yes I mean keys. I have no strong feelings about it either way, since my library already has a handler for it
Although I will say one thing... if you want to canonicalize after the tape has been written, it's sort of a pain. You can canonicalize where currently we call strcmp
, and you can use storage on the stack for that. But then you have to do it every time you traverse that key. The alternative, caching the canonicalized forms, is a decent bit of work
The alternative, caching the canonicalized forms, is a decent bit of work
I am guessing that we can handle canonicalized forms relatively cheaply because it will be so rarely used. I'll make a new issue out of it.
Delaying to release 0.4.
In our test set, we do not have non-ASCII keys:
$ for i in jsonexamples/*.json ; do echo $i; ./jsonstats $i | grep "key_count"; donejsonexamples/apache_builds.json
"key_count" = 881,
"ascii_key_count" = 881,
jsonexamples/canada.json
"key_count" = 4,
"ascii_key_count" = 4,
jsonexamples/citm_catalog.json
"key_count" = 10935,
"ascii_key_count" = 10935,
jsonexamples/github_events.json
"key_count" = 180,
"ascii_key_count" = 180,
jsonexamples/gsoc-2018.json
"key_count" = 3793,
"ascii_key_count" = 3793,
jsonexamples/instruments.json
"key_count" = 1012,
"ascii_key_count" = 1012,
jsonexamples/marine_ik.json
"key_count" = 9680,
"ascii_key_count" = 9680,
jsonexamples/mesh.json
"key_count" = 2,
"ascii_key_count" = 2,
jsonexamples/mesh.pretty.json
"key_count" = 2,
"ascii_key_count" = 2,
jsonexamples/numbers.json
"key_count" = 0,
"ascii_key_count" = 0,
jsonexamples/random.json
"key_count" = 4001,
"ascii_key_count" = 4001,
jsonexamples/twitter.json
"key_count" = 1264,
"ascii_key_count" = 1264,
jsonexamples/twitterescaped.json
"key_count" = 1264,
"ascii_key_count" = 1264,
jsonexamples/update-center.json
"key_count" = 1896,
"ascii_key_count" = 1896,
what if the API allowed the user to pass in a function pointer for the comparison. it would cover the case-insensitive use-case, as well as this use case
@michaeleisel
what if the API allowed the user to pass in a function pointer for the comparison. it would cover the case-insensitive use-case, as well as this use case
Function pointers are typically not inlined which has a lot of consequences with respect to performance. You pay for the flexibility.
Note that users can check whether their keys are ASCII by calling a function like the following...
#include <string_view>
#include <cstring>
bool is_ascii(std::string_view v) {
uint64_t running = 0;
size_t i = 0;
for(; i + 8 <= v.size(); i+=8) {
uint64_t payload;
memcpy(&payload, v.data() + i, 8);
running |= payload;
}
for(; i < v.size(); i++) {
running |= v[i];
}
return (running & 0x8080808080808080) == 0;
}
👍
A note: by default, on demand assumes keys are raw utf8 (no escapes) when looking up fields, getting a significant speedup for the overwhelmingly common case.
(As such, I'm feeling like we don't absolutely need this for 1.0. Moving out; feel.free.to.move back / disagree :))
by raw utf-8, you mean that they don't use \u...
? if so, the issue is still that the same unicode character, i.e. grapheme cluster, can be represented by different code points. that having been said, i agree this is not a super high priority and have no need for it at this moment myself, so any version is fine from my perspective.
I don't think most json field lookup algorithms consider grapheme equivalence or normalization anyway :)
I don't think most json field lookup algorithms consider grapheme equivalence or normalization anyway :)
Agreed.
Moved to 2.0.