jsonpath-rust icon indicating copy to clipboard operation
jsonpath-rust copied to clipboard

get rid of JsonPathFinder and Boxes

Open xMAC94x opened this issue 2 years ago • 7 comments

Hi @besok I started using your crate and got into performance issues, now I did some looks into it and tried to simplify the interface a bit. Removed the JsonPathFinder and just had the methods there as stand-alone methods, like common in the rust env.

Want to get your opinion on this, what do you think ?

xMAC94x avatar Dec 05 '23 17:12 xMAC94x

Great. Thanks for the pull request. I will try to come back to it this week( I am currently on vacation without access to my laptop unfortunately:( )

Btw, getting rid of the given struct has boosted performance or it is rather due to redundancy of the layouts?

besok avatar Dec 06 '23 10:12 besok

I wanted to search in some existing json Value however i can't pass a &Value to the struct, so i would need to copy it. One could think about extending the struct to also allow & but its way easier to get rid of it entirely, it's not doing too much anyways.

Additionally i think having a sleek api that makes it easy to reuse code and doesn't encouraging cloning is a good way to make it hard for consumers of this crate to missuse. Ideally it should be straight forward to use, and the straight forward approach should come with no hidden cost.

But for now enjoy the vacation :)

xMAC94x avatar Dec 06 '23 20:12 xMAC94x

Yeah. thanks. :)

I agree the supplementary costs should be avoided, the struct JsonPathFinder was created simply intending to use it directly. like having two strings (JSON and js path) and calculating the path "on a fly".

Maybe yo are right. I will need to recall what the struct does now actually =)

besok avatar Dec 07 '23 02:12 besok

closes #59

besok avatar Dec 07 '23 02:12 besok

It has some fmt and clippy failures. You can fix it locally just kicking off the appropriate commands and correct the code

besok avatar Dec 23 '23 11:12 besok

@besok can you fix those maybe? I would like to use this otherwise I will fork it no problem but since its only clippy fixes its probably done quickly.

Norlock avatar May 15 '24 10:05 Norlock

There are merge conflicts since the aforementioned in the PR structure obtained a new field (cfg)! I can take a look at the weekend and either resolve the conflicts or discard the pr :)

besok avatar May 15 '24 21:05 besok

Just had a look, it seems like #62 adds a cache for regex, imo this is is not necessary when doing it like in this PR. You dont need a cache because you just use the existing regex, and this can cache the regex at the caller.

Simplifies this code and lets the caller do stuff more efficient.

I can do a rebase, but we first have to discuss what to do with those caching behavior.

xMAC94x avatar May 17 '24 10:05 xMAC94x

You dont need a cache because you just use the existing regex, and this can cache the regex at the caller.

I don't grasp fully what you mean tbh, can you give an example or expand the answer?

besok avatar May 17 '24 10:05 besok

I'll try to explain, I don't remember the internals from this crate as the PR is a bit older.

So mostly #61 complained about the crate beeing slow. And they even said what is slow, recompiling the regex.

By looking in this crate, example:

// https://github.com/besok/jsonpath-rust/blob/main/src/lib.rs#L150
let json: Value = serde_json::from_str("{}").unwrap();
let path = json.path("$..book[?(@.author size 10)].title").unwrap();

Lets assume the following speed behavior:

  1. creating a json object: slow
  2. creating a json path object: slow
  3. actually searching: fast

line 1 does step 1 line 2 does step 2 and 3 combined.

When people complain about something beeing slow, they prob do it for alot of things, e.g. have the same json data and search different things inside, or have different json data but search the same thing inside. (or even both, the important stuff is, often one thing stays the same).

The way this is currently build does not profit when one thing stays constant, it has to redo step 1 or step 2 in either case. Those are the slow steps.

With this PR:

let json: Value = serde_json::from_str("{}")?;
let path: JsonPathInst = JsonPathInst::from_str("$..book[?(@.author size 10)].title")?;

let v = find_slice(&path, &json);

You see that we need an additional line, one line per step. And you'll notice that the find_slice actually just takes references, that means it does not consume the data as its only accessed read-only.

This allows a cunsumer of this librarby, like the person from #61 to precompile the regex easily, and reuse it as often as they want. They can even go multihtreading if they need.

The idea is: rather to inject the cache behavior into the whole stack: just make this library work simply, and thus allows devs to reuse (=effectively building their own cache) regex that are already compiled

xMAC94x avatar May 17 '24 16:05 xMAC94x

Okay, just had some refactoring and mid into it i noted what we were talking about, the regex is actually executed as part of the search on not when creating the jsonpathinst. thus getting rid of the cache is a regression here.

Though anyway maybe I can convince you, rather than building a cache, evaluate the respective regexs on creation of the jsonpathinst.

Getting rid of all the boxes with this PR improves the default case on my machine from: 139.82 µs down to 86.049 µs When keeping the data, even down to 56.392 µs. Thats not quite yet the 24.638 µs

But on the other hand, every other operation, that is not a regex, gets a speeup from 23.448 µs to 512.16 ns (e.g. see equal benchmark).

When combining both we might even see sub us for the regex case

xMAC94x avatar May 17 '24 17:05 xMAC94x

The introduced cache handles a specific situation. The compilation of regex is a voracious operation and takes quite some time.

Thus, if a user has either multithreading or a stream of queries with the same regex in the long-living app, the cache will boost performance. It can be also serialized and spreaded further (I doubt it needs to anybody but the opportunity exists )

Though anyway maybe I can convince you, rather than building a cache, evaluate the respective regexs on creation of the jsonpathinst.

This is an unconnected process or at least I don't see a bond.

Still, I would not say the cache and removing boxes and other auxiliary structures is a mutually exclusive process. And grooming the API and removing deprecated (so to say) structures also would be good to merge into. Therefore, I would not remove the cache but instead introduce a new method like find_slice_with_cfg and so on.

besok avatar May 17 '24 18:05 besok

Here is some ugly code that statically compiles the regex, when the JsonPathInst is generated: https://github.com/besok/jsonpath-rust/commit/e527958315967cf13f65a887403d9677292028cc

compilation time goes up by: JsonPathInst generation time: [76.777 µs 76.844 µs 76.915 µs] change: [+2.6997% +3.0563% +3.3790%] (p = 0.00 < 0.05) Performance has regressed.

while now execution time improves by: regex bench with reuse time: [591.67 ns 594.80 ns 598.18 ns] change: [-98.975% -98.969% -98.964%] (p = 0.00 < 0.05) Performance has improved.

this is an other 50-times fold improvement.

The only thing i have: It currently only works with static regex, idk if dynamic regex is a thing. Do you know if dynamic regex is a thing. I also dont know if there is an less-ugly way to encode it rather than in the FilterSign.

xMAC94x avatar May 17 '24 19:05 xMAC94x

But that is precisely the case. We parse a string that represents regex. The compilation of this string into the particular regular expression is what the cache is summoned to solve.

In your example, you precompile it beforehand outside of the jspath.

besok avatar May 17 '24 20:05 besok

sorry, non-native speaker here. Is this now a good thing that its compiled with the jspath or a bad thing ?

xMAC94x avatar Jun 03 '24 07:06 xMAC94x

It is not bad or good. It does not solve the initial problem. The problem is the following: Given - we have 1K requests for jspath like $.[?(@.field == '[a-zA-Z]')]. Target - compile it only once and then use already compiled regex.

You precompile a regular expression outside of the library. But it should be done inside the library because we parse a string into regex inside the library. So, you need something inside the library that will keep track of the regexes that are already compiled.

besok avatar Jun 03 '24 08:06 besok

in that example:

let path = JsonPathInst::from_str('$.[?(@.field == '[a-zA-Z]')]').unwrap();
for _ in 0..1000 {
  // actually 1000 different jsons:
  let json = json!({
        "field":"abcXYZ",
  });
  let x = jsonpath_rust::find(&path, &json);
}

we can solve the problem with regex compilation in 2 ways:

  • add a cache that espeically only caches regexes and stores them.
  • just compile the regex when the path is created, that way we can leverage both effects. the regex is only compiled ones, but the rest is only compiled once. Even higher speed increase. no need for an internal hashmap with expensive hashmap lookups.

Or do I get it wrong somehow ?

xMAC94x avatar Jun 03 '24 12:06 xMAC94x

I think in general you are correct.

add a cache that espeically only caches regexes and stores them.

Yeah, that is how it is implemented now.

just compile the regex when the path is created, that way we can leverage both effects. the regex is only compiled ones, but the rest is only compiled once. Even higher speed increase. no need for an internal hashmap with expensive hashmap lookups.

Maybe, I missed something but a couple of points. You want the users to implement a so to say, user cache with jsonpathinst instead of regexes, don't you? It will work if you have the same queries but in general, a user can have something like that:

  • $.[?(@.field == '[a-zA-Z]')]
  • $.field1[?(@.field == '[a-zA-Z]')]
  • $.[?(@.field == '[a-zA-Z]')].field2.length
  • $.[?(@ == '[a-zA-Z]')]

and so on. The queries will be different and the regex will be calculated again.

besok avatar Jun 03 '24 14:06 besok

Good point, I haven't though about that yet.

I don't necessarily want the user to build a cache in the sense of a HashMap. But I want the user be able to reuse jsonpathinst for multiple json data easily. That has a similary effect on speed (and technically could also be called a cache) because it doesn't need to be reevaluated on each .find()

Mentioning different Json-Paths but with the same regex inside is truely something we should consider. Such a scenario would definitely benefit from a cache. The question now is how common is that for users of this crate. Because while a cache helps if you have many matches, it also adds costs for all other queries that are not different-json-path-containing-same-regex.

It now gets a bit tricky, because we have to think how a user of this crates uses it and find a good tradeoff.

IMO I would prob see a user have some different-json-path-containing-same-regex like paths, but I don't see them scaling. what I want to say is, maybe they have 10 of this queries, but it would rather be uncommon that they have thousands.

I would probably expect most use cases are either: The application has a fixed number of paths and many scaling data, or the application has a fixed number of data and is checking many paths. In both cases, beeing able to reuse via reference would provide a huge benefit (see above -98,9%, that 100x faster).

That would allow, even if there are hundreds of jsonpath where the same regex is used, we would still be faster in the end.

The only exception is probably a application where every jsonpath, without exception is different from each other but share the same regex

xMAC94x avatar Jun 04 '24 08:06 xMAC94x

Very good point. I don't know why but the idea of caching the whole jspath slipped through my fingers. I will think about it in the evening a bit, searching for pitfalls.

besok avatar Jun 04 '24 08:06 besok

I have taken a look to the benches and truly it seems the benefit of having a regex cache is leveled by the alternative of caching jspathinst itself.

Thanks for the great PR! The solution is definitely better and cleaner than the one with the regex cache.

besok avatar Jun 04 '24 21:06 besok