Add fast-path for simple filters
I have a ~600MB NDJSON file with ~1M lines (happy to share if it's helpful).
Using jq to search for an individual entry:
$ time jq 'select(.id==3260839)' file.ndjson
# ~9 seconds
Using fx -s to do the same is ~4x slower:
$ time fx -s file.ndjson '?.id == 3260839'
# ~33s
Using skip (ala #292) without -s is considerably faster but still slower than jq:
$ time fx file.ndjson 'x.id == 3260839 ? x : skip'
# ~18s
Of course, I can grep first to speed things up, but it would be nice if this weren't necessary:
$ time grep 3260839 file.ndjson | fx -s '?.id == 3260839'
# ~7s
Since there's only ~1M rows in this ndjson file, I doubt that running x => x.id == 3260839 is the bottleneck. Maybe it's just the JSON parsing?
happy to share if it's helpful
Please, it will be useful)
Using fx -s to do the same is ~4x slower
Yes! For -s slurp, fx has to create an array of all 1M line in memory. No surprises what it is slow. I think there is not much we can do here.
Using
skipwithout -s
I need to profile what is slower: parsing or JS engine. I suspect it is the JS engine.
Since there's only ~1M rows in this ndjson file, I doubt that running x => x.id == 3260839 is the bottleneck. Maybe it's just the JSON parsing?
I did JSON parsing pretty fast =) During parsing no allocations are made, no objects nor arrays are constructed. But as soon some JS code needs to be executed, objects/arrays have to be constructed from JSON text and passed to JS engine.
I will think of a solution for simple filtering cases like fx -s '?.id == 3260839'. Maybe it will be possible to do some filtering without constructing object and running JS engine at all.
happy to share if it's helpful
Please, it will be useful)
I put a copy at https://storage.googleapis.com/danvk-boggle/results/4x4-full.ndjson.zip (it's ~170MB).
Using fx -s to do the same is ~4x slower
Yes! For
-sslurp, fx has to create an array of all 1M line in memory. No surprises what it is slow. I think there is not much we can do here.
I suppose if the first operation is a filter or a map, you could still operate one item at a time, without having to create the full array in memory.
I will think of a solution for simple filtering cases like
fx -s '?.id == 3260839'. Maybe it will be possible to do some filtering without constructing object and running JS engine at all.
Yes, that would be helpful. Though remember that == has weird semantics in JS :)
> "123" == 123
true
> "123" === 123
false
I suppose if the first operation is a filter or a map, you could still operate one item at a time, without having to create the full array in memory.
Yes! This could be done.
Yes, that would be helpful. Though remember that == has weird semantics in JS :)
I'm thinking of maybe allow for faster iteration if the special syntax with ?.xxx is used.
I tested fx with only JSON parsing and no JS engine. As expected, this is very fast.
Let's add the ability to fx to do a "fast path" and for special case like ?.xx == y bypass JS engine.
Also the filtering without JS will work only for simple cases:
?.var == 42?.var == "str"?.var == 'str'?.value > 0- etc.
Other cases may be added as needed.
That would do the trick. I'm just thinking that your fast path will need to replicate JavaScript's implicit coercions to preserve existing behavior, e.g.
$ echo '{"x": 1}\n{"x": "1"}' | fx -s '?.x == 1'
[
{ "x": 1 },
{ "x": "1" }
]
$ echo '{"x": null}\n{"x": "1"}' | fx -s '?.x <= 1'
[
{ "x": null },
{ "x": "1" }
]
Hmmm, I guess we can start with just ==. Maybe add .includes() as well.