simdjson Detect if any keys have non-ASCII characters

Detect if any keys have non-ASCII characters

Open michaeleisel opened this issue 4 years ago • 18 comments

Non-ascii keys seem pretty weird, but I guess they must be out there. Normally, there are no keys like this and so you can deal with strings in a fast and simple way, e.g. using strcmp without having to canonicalize both strings being compared. So, it would be nice if there was some indicator that at some point in the JSON document there is a non-ASCII character in a key, causing parsers built on top of simdjson to be more careful. It's easy to vectorize the check for non-ascii characters. I don't know what priority this should have, or the exact API (a simple bool?), but I thought I would bring it up.

Aug 07 '19 13:08 michaeleisel

I think that your issue is valid. It is related to some existing issues like...

https://github.com/lemire/simdjson/issues/184

The main idea is that, often, keys are short and ASCII.

Whether there is any sense in having such a flag that users of the library can access... I do not know... but currently, simdjson does not attempt to take advantage of this at all.

Aug 07 '19 14:08 lemire

I am curious how the devs of the other libs built on simdjson feel about this, and I still don't know the real-world occurrence rate of non-ASCII JSON. From there, here are some options:

leave as-is
detect
detect and canonicalize before writing to tape

Aug 20 '19 20:08 michaeleisel

By non-ASCII JSON, I think you refer to the keys... because there is lots of non-ASCII characters in JSON documents, evidently.

I think it may make sense to detect this. I think we want to reengineer in any case, because it makes sense to treat keys differently, something we do not do currently.

Aug 20 '19 20:08 lemire

Makes sense, and yes I mean keys. I have no strong feelings about it either way, since my library already has a handler for it

Aug 20 '19 20:08 michaeleisel

Although I will say one thing... if you want to canonicalize after the tape has been written, it's sort of a pain. You can canonicalize where currently we call strcmp, and you can use storage on the stack for that. But then you have to do it every time you traverse that key. The alternative, caching the canonicalized forms, is a decent bit of work

Aug 20 '19 20:08 michaeleisel

The alternative, caching the canonicalized forms, is a decent bit of work

I am guessing that we can handle canonicalized forms relatively cheaply because it will be so rarely used. I'll make a new issue out of it.

Aug 20 '19 23:08 lemire

Delaying to release 0.4.

Jan 09 '20 21:01 lemire

In our test set, we do not have non-ASCII keys:

 $ for i in jsonexamples/*.json ; do echo $i; ./jsonstats $i | grep "key_count"; donejsonexamples/apache_builds.json
      "key_count"                =        881,
      "ascii_key_count"          =        881,
jsonexamples/canada.json
      "key_count"                =          4,
      "ascii_key_count"          =          4,
jsonexamples/citm_catalog.json
      "key_count"                =      10935,
      "ascii_key_count"          =      10935,
jsonexamples/github_events.json
      "key_count"                =        180,
      "ascii_key_count"          =        180,
jsonexamples/gsoc-2018.json
      "key_count"                =       3793,
      "ascii_key_count"          =       3793,
jsonexamples/instruments.json
      "key_count"                =       1012,
      "ascii_key_count"          =       1012,
jsonexamples/marine_ik.json
      "key_count"                =       9680,
      "ascii_key_count"          =       9680,
jsonexamples/mesh.json
      "key_count"                =          2,
      "ascii_key_count"          =          2,
jsonexamples/mesh.pretty.json
      "key_count"                =          2,
      "ascii_key_count"          =          2,
jsonexamples/numbers.json
      "key_count"                =          0,
      "ascii_key_count"          =          0,
jsonexamples/random.json
      "key_count"                =       4001,
      "ascii_key_count"          =       4001,
jsonexamples/twitter.json
      "key_count"                =       1264,
      "ascii_key_count"          =       1264,
jsonexamples/twitterescaped.json
      "key_count"                =       1264,
      "ascii_key_count"          =       1264,
jsonexamples/update-center.json
      "key_count"                =       1896,
      "ascii_key_count"          =       1896,

Mar 25 '20 18:03 lemire

what if the API allowed the user to pass in a function pointer for the comparison. it would cover the case-insensitive use-case, as well as this use case

Jun 21 '20 15:06 michaeleisel

@michaeleisel

what if the API allowed the user to pass in a function pointer for the comparison. it would cover the case-insensitive use-case, as well as this use case

Function pointers are typically not inlined which has a lot of consequences with respect to performance. You pay for the flexibility.

Jun 21 '20 18:06 lemire

Note that users can check whether their keys are ASCII by calling a function like the following...

#include <string_view>
#include <cstring>
bool is_ascii(std::string_view v) {
  uint64_t running = 0;
  size_t i = 0;
  for(; i + 8 <= v.size(); i+=8) {
    uint64_t payload;
    memcpy(&payload, v.data() + i, 8);
    running |= payload;
  }
  for(; i < v.size(); i++) {
      running |= v[i];
  }
  return (running & 0x8080808080808080) == 0;  
}

Jul 21 '20 17:07 lemire

👍

Jul 21 '20 19:07 michaeleisel

A note: by default, on demand assumes keys are raw utf8 (no escapes) when looking up fields, getting a significant speedup for the overwhelmingly common case.

Jan 25 '21 07:01 jkeiser

(As such, I'm feeling like we don't absolutely need this for 1.0. Moving out; feel.free.to.move back / disagree :))

Jan 25 '21 07:01 jkeiser

by raw utf-8, you mean that they don't use \u...? if so, the issue is still that the same unicode character, i.e. grapheme cluster, can be represented by different code points. that having been said, i agree this is not a super high priority and have no need for it at this moment myself, so any version is fine from my perspective.

Jan 25 '21 14:01 michaeleisel

I don't think most json field lookup algorithms consider grapheme equivalence or normalization anyway :)

Jan 25 '21 15:01 jkeiser

I don't think most json field lookup algorithms consider grapheme equivalence or normalization anyway :)

Agreed.

Jan 25 '21 21:01 lemire

Moved to 2.0.

Jan 25 '21 21:01 lemire

simdjson simdjson copied to clipboard

Detect if any keys have non-ASCII characters

simdjson
simdjson copied to clipboard