simd-json icon indicating copy to clipboard operation
simd-json copied to clipboard

Consider using IndexMap: json order is random after a 33 keys in map

Open ritchie46 opened this issue 1 year ago • 5 comments

This python snippet wouldn't round trip if simd-json was used:

import json

obj = {c: None for c in "abcdefghijklmnopqrstuvwxyz1234567"}
s = json.dumps(obj)
assert json.loads(s) == obj

This is because Object uses a hashmap which doesn't preserve the order of the json string keys. This can be resolved by using I would like to consider using indexmap, which has excellent performance, AND guarantees index order.

This issue was discovered in: https://github.com/pola-rs/polars/issues/17439

If this is accepted, I am willing to make a PR.

ritchie46 avatar Jul 07 '24 09:07 ritchie46

Hi ritchie,

by specification json objects are unordered sets but that also doesn't mean it has to be unordered. the order for the first 32 keys is an implementation detail. We use halfbrown that uses a simple vector for up to 32 elements which makes small objects, which are frequent with json, extremely fast and small (memory wise).

That's just for background :). Generally if we can make it optional and preserve the current behaviour I'm in, giving people more options is nice!

A feature to watch out for is known-key, it is quite powerful when using lookups of the same key repeatedly and it'd be good to preserve it

I could see a few ways to make this work:

  1. a feature flag, probably the easiest but there is some chance of conflicts with other flags
  2. a feature flag in halfbrown, same as above but it limits the interface change to the library for hashmaps that simd-json already uses
  3. make the map type a generic - this will require more work and generally I like it but OTOH I'm not sure how well that would play out or how much work it would be
  4. same as abovbe but inside halfbrown

Licenser avatar Jul 07 '24 10:07 Licenser

Ah, given that you also maintain halfbrown, maybe doing that upstream is nicest. Would a halfbrown based on IndexMap be worth it?

For me a feature flag would be fine. So that would be option 2. :eyes:

ritchie46 avatar Jul 07 '24 12:07 ritchie46

perfect, sounds like we got a plan then :)!

Also #377 might be of interest for you, it keeps order as long as no mutations are done already (and it should "just work" with index map as well)

Licenser avatar Jul 07 '24 12:07 Licenser

Great! Thanks for being receptive to this. :)

For #377, I need a little bit more context. :sweat_smile: What does it do? What is the semantic difference between a borrowed value and a tape?

ritchie46 avatar Jul 07 '24 13:07 ritchie46

the borrowed value is a fully mutable DOM equivalent to serde_json::Value; the tape value is a flat structire (a single vec) that represents the JSON document, since it has no nesting and doesn't require extra allocations, inserts, etc it is significantly faster.

the downside is that the tape is not easily mutable; the lazy value bridges the gab between the two :)

Licenser avatar Jul 07 '24 14:07 Licenser

A little late to the party, but just arrived here after hitting an edge case at the 33 key mark :-)

I am using simd-json in my JSON schema inference crate genson-core

I see #377 got merged, does that mean this is ready to be worked on? 👀

My initial impression is I'm after the mutable IndexMap backed variant rather than the tape approach 🤔

I've contributed an indexmap feature in a PR to halfbrown to support this:

  • https://github.com/Licenser/halfbrown/pull/37

lmmx avatar Oct 07 '25 22:10 lmmx

I don't think adding the feature flag to half-brown is the right path here, this would lead that if enabled every use of half-brown in a project would become index-map based which could lead to very surprising and unwelcome side effects.

The best approach here would be a Value that is distinct from the current one and uses a different backed for the hash-map, that way they could be used side by side.

As an example if I use your genson-core library for schemas in a project, the index-map feature would be applied to my instances of Value too leading to a difference in behavior and performance which would be surprising.

Licenser avatar Oct 08 '25 07:10 Licenser

The best approach here would be a Value that is distinct from the current one

I took this to mean new types like OrderedBorrowedValue and OrderedOwnedValue in simd-json (just to be explicit with how I read that)

I've got some tests for 33+ keys and started modifying the Object<'value> type to conditionally compile in HashMap/IndexMap based on the preserve_order feat, but I'm hitting trait bound errors because value-trait doesn't implement ObjectTrait for IndexMap:

error[E0277]: the trait bound `IndexMap<Cow<'value, str>, Value<'value>, ...>: ObjectTrait` is not satisfied
   --> src/value/borrowed.rs:300:19
    |
300 |     type Object = Object<'value>;
    |                   ^^^^^^^^^^^^^^ the trait `value_trait::prelude::ObjectTrait` is not implemented for `IndexMap<...>`
    |
    = help: the following other types implement trait `value_trait::prelude::ObjectTrait`:
              HashMap<MapK, MapE, S>
              SizedHashMap<MapK, MapE, S>

It seems like the path forward is:

  1. Add IndexMap support to value-trait (behind an indexmap feature flag?)
  2. Then use it in simd-json's preserve_order feature

Does that sound right? I presume exposing traits behind a feature flag won't have undesirable effects on other crate consumers like the halfbrown approach would. Happy to contribute the value-trait changes if I'm on the right track here :-)

Update working PR in #436, I've tested it locally and pushed a Cargo TOML that expects the associated value-trait crate to be released as version 0.13 (accompanying PR sent there as https://github.com/simd-lite/value-trait/pull/60)

lmmx avatar Oct 12 '25 13:10 lmmx

Ja, I prefer a new type for the index map backed value over it switching to allow the use of multiple map backends in the same project without feature flag conflicts or surprising changes in behavior. An alternative to OrderedBorrowedValue would be parameterizing BorrowedValue with a map backend not sure if that would improve things or make then tougher 😅 ultimately I think both are okay.

Licenser avatar Oct 12 '25 18:10 Licenser

Thanks for releasing value-trait, I updated the PR with it at 0.12.1 so looks like it'll work ! 🙂 I've re-rolled into a single commit now that it's feature complete

lmmx avatar Oct 12 '25 19:10 lmmx

I'm not against generics with a default but since I can't fully predict all the requirements they might incur I'd avoid them too. Though on the other hand it looks like it'd save a lot of duplication, but not sure how to do it, I tried a recursive type alias but no success.

I think adding OrderedObject types behind a feature flag and then a mod at the end of the src/borrowed/value.rs module that provides a Value variant that is then re-exported as OrderedBorrowedValue in src/value.rs could be nice, going to give that a try, 3rd time's the charm? 😅

lmmx avatar Oct 13 '25 11:10 lmmx