simdjson icon indicating copy to clipboard operation
simdjson copied to clipboard

Detect if any keys have non-ASCII characters

Open michaeleisel opened this issue 4 years ago • 18 comments

Non-ascii keys seem pretty weird, but I guess they must be out there. Normally, there are no keys like this and so you can deal with strings in a fast and simple way, e.g. using strcmp without having to canonicalize both strings being compared. So, it would be nice if there was some indicator that at some point in the JSON document there is a non-ASCII character in a key, causing parsers built on top of simdjson to be more careful. It's easy to vectorize the check for non-ascii characters. I don't know what priority this should have, or the exact API (a simple bool?), but I thought I would bring it up.

michaeleisel avatar Aug 07 '19 13:08 michaeleisel

I think that your issue is valid. It is related to some existing issues like...

https://github.com/lemire/simdjson/issues/184

The main idea is that, often, keys are short and ASCII.

Whether there is any sense in having such a flag that users of the library can access... I do not know... but currently, simdjson does not attempt to take advantage of this at all.

lemire avatar Aug 07 '19 14:08 lemire

I am curious how the devs of the other libs built on simdjson feel about this, and I still don't know the real-world occurrence rate of non-ASCII JSON. From there, here are some options:

  • leave as-is
  • detect
  • detect and canonicalize before writing to tape

michaeleisel avatar Aug 20 '19 20:08 michaeleisel

By non-ASCII JSON, I think you refer to the keys... because there is lots of non-ASCII characters in JSON documents, evidently.

I think it may make sense to detect this. I think we want to reengineer in any case, because it makes sense to treat keys differently, something we do not do currently.

lemire avatar Aug 20 '19 20:08 lemire

Makes sense, and yes I mean keys. I have no strong feelings about it either way, since my library already has a handler for it

michaeleisel avatar Aug 20 '19 20:08 michaeleisel

Although I will say one thing... if you want to canonicalize after the tape has been written, it's sort of a pain. You can canonicalize where currently we call strcmp, and you can use storage on the stack for that. But then you have to do it every time you traverse that key. The alternative, caching the canonicalized forms, is a decent bit of work

michaeleisel avatar Aug 20 '19 20:08 michaeleisel

The alternative, caching the canonicalized forms, is a decent bit of work

I am guessing that we can handle canonicalized forms relatively cheaply because it will be so rarely used. I'll make a new issue out of it.

lemire avatar Aug 20 '19 23:08 lemire

Delaying to release 0.4.

lemire avatar Jan 09 '20 21:01 lemire

In our test set, we do not have non-ASCII keys:

 $ for i in jsonexamples/*.json ; do echo $i; ./jsonstats $i | grep "key_count"; donejsonexamples/apache_builds.json
      "key_count"                =        881,
      "ascii_key_count"          =        881,
jsonexamples/canada.json
      "key_count"                =          4,
      "ascii_key_count"          =          4,
jsonexamples/citm_catalog.json
      "key_count"                =      10935,
      "ascii_key_count"          =      10935,
jsonexamples/github_events.json
      "key_count"                =        180,
      "ascii_key_count"          =        180,
jsonexamples/gsoc-2018.json
      "key_count"                =       3793,
      "ascii_key_count"          =       3793,
jsonexamples/instruments.json
      "key_count"                =       1012,
      "ascii_key_count"          =       1012,
jsonexamples/marine_ik.json
      "key_count"                =       9680,
      "ascii_key_count"          =       9680,
jsonexamples/mesh.json
      "key_count"                =          2,
      "ascii_key_count"          =          2,
jsonexamples/mesh.pretty.json
      "key_count"                =          2,
      "ascii_key_count"          =          2,
jsonexamples/numbers.json
      "key_count"                =          0,
      "ascii_key_count"          =          0,
jsonexamples/random.json
      "key_count"                =       4001,
      "ascii_key_count"          =       4001,
jsonexamples/twitter.json
      "key_count"                =       1264,
      "ascii_key_count"          =       1264,
jsonexamples/twitterescaped.json
      "key_count"                =       1264,
      "ascii_key_count"          =       1264,
jsonexamples/update-center.json
      "key_count"                =       1896,
      "ascii_key_count"          =       1896,

lemire avatar Mar 25 '20 18:03 lemire

what if the API allowed the user to pass in a function pointer for the comparison. it would cover the case-insensitive use-case, as well as this use case

michaeleisel avatar Jun 21 '20 15:06 michaeleisel

@michaeleisel

what if the API allowed the user to pass in a function pointer for the comparison. it would cover the case-insensitive use-case, as well as this use case

Function pointers are typically not inlined which has a lot of consequences with respect to performance. You pay for the flexibility.

lemire avatar Jun 21 '20 18:06 lemire

Note that users can check whether their keys are ASCII by calling a function like the following...

#include <string_view>
#include <cstring>
bool is_ascii(std::string_view v) {
  uint64_t running = 0;
  size_t i = 0;
  for(; i + 8 <= v.size(); i+=8) {
    uint64_t payload;
    memcpy(&payload, v.data() + i, 8);
    running |= payload;
  }
  for(; i < v.size(); i++) {
      running |= v[i];
  }
  return (running & 0x8080808080808080) == 0;  
}

lemire avatar Jul 21 '20 17:07 lemire

👍

michaeleisel avatar Jul 21 '20 19:07 michaeleisel

A note: by default, on demand assumes keys are raw utf8 (no escapes) when looking up fields, getting a significant speedup for the overwhelmingly common case.

jkeiser avatar Jan 25 '21 07:01 jkeiser

(As such, I'm feeling like we don't absolutely need this for 1.0. Moving out; feel.free.to.move back / disagree :))

jkeiser avatar Jan 25 '21 07:01 jkeiser

by raw utf-8, you mean that they don't use \u...? if so, the issue is still that the same unicode character, i.e. grapheme cluster, can be represented by different code points. that having been said, i agree this is not a super high priority and have no need for it at this moment myself, so any version is fine from my perspective.

michaeleisel avatar Jan 25 '21 14:01 michaeleisel

I don't think most json field lookup algorithms consider grapheme equivalence or normalization anyway :)

jkeiser avatar Jan 25 '21 15:01 jkeiser

I don't think most json field lookup algorithms consider grapheme equivalence or normalization anyway :)

Agreed.

lemire avatar Jan 25 '21 21:01 lemire

Moved to 2.0.

lemire avatar Jan 25 '21 21:01 lemire