fx icon indicating copy to clipboard operation
fx copied to clipboard

Add fast-path for simple filters

Open danvk opened this issue 3 months ago • 7 comments

I have a ~600MB NDJSON file with ~1M lines (happy to share if it's helpful).

Using jq to search for an individual entry:

$ time jq 'select(.id==3260839)' file.ndjson
# ~9 seconds

Using fx -s to do the same is ~4x slower:

$ time fx -s file.ndjson '?.id == 3260839'
# ~33s

Using skip (ala #292) without -s is considerably faster but still slower than jq:

$ time fx file.ndjson 'x.id == 3260839 ? x : skip'
# ~18s

Of course, I can grep first to speed things up, but it would be nice if this weren't necessary:

$ time grep 3260839 file.ndjson | fx -s '?.id == 3260839'
# ~7s

Since there's only ~1M rows in this ndjson file, I doubt that running x => x.id == 3260839 is the bottleneck. Maybe it's just the JSON parsing?

danvk avatar Sep 11 '25 15:09 danvk

happy to share if it's helpful

Please, it will be useful)

Using fx -s to do the same is ~4x slower

Yes! For -s slurp, fx has to create an array of all 1M line in memory. No surprises what it is slow. I think there is not much we can do here.

Using skip without -s

I need to profile what is slower: parsing or JS engine. I suspect it is the JS engine.

Since there's only ~1M rows in this ndjson file, I doubt that running x => x.id == 3260839 is the bottleneck. Maybe it's just the JSON parsing?

I did JSON parsing pretty fast =) During parsing no allocations are made, no objects nor arrays are constructed. But as soon some JS code needs to be executed, objects/arrays have to be constructed from JSON text and passed to JS engine.

I will think of a solution for simple filtering cases like fx -s '?.id == 3260839'. Maybe it will be possible to do some filtering without constructing object and running JS engine at all.

antonmedv avatar Sep 12 '25 09:09 antonmedv

happy to share if it's helpful

Please, it will be useful)

I put a copy at https://storage.googleapis.com/danvk-boggle/results/4x4-full.ndjson.zip (it's ~170MB).

Using fx -s to do the same is ~4x slower

Yes! For -s slurp, fx has to create an array of all 1M line in memory. No surprises what it is slow. I think there is not much we can do here.

I suppose if the first operation is a filter or a map, you could still operate one item at a time, without having to create the full array in memory.

I will think of a solution for simple filtering cases like fx -s '?.id == 3260839'. Maybe it will be possible to do some filtering without constructing object and running JS engine at all.

Yes, that would be helpful. Though remember that == has weird semantics in JS :)

> "123" == 123
true
> "123" === 123
false

danvk avatar Sep 12 '25 12:09 danvk

I suppose if the first operation is a filter or a map, you could still operate one item at a time, without having to create the full array in memory.

Yes! This could be done.

Yes, that would be helpful. Though remember that == has weird semantics in JS :)

I'm thinking of maybe allow for faster iteration if the special syntax with ?.xxx is used.

antonmedv avatar Sep 12 '25 12:09 antonmedv

I tested fx with only JSON parsing and no JS engine. As expected, this is very fast.

Let's add the ability to fx to do a "fast path" and for special case like ?.xx == y bypass JS engine.

antonmedv avatar Sep 12 '25 12:09 antonmedv

Also the filtering without JS will work only for simple cases:

  • ?.var == 42
  • ?.var == "str"
  • ?.var == 'str'
  • ?.value > 0
  • etc.

Other cases may be added as needed.

antonmedv avatar Sep 12 '25 12:09 antonmedv

That would do the trick. I'm just thinking that your fast path will need to replicate JavaScript's implicit coercions to preserve existing behavior, e.g.

$ echo '{"x": 1}\n{"x": "1"}' | fx -s '?.x == 1'
[
  { "x": 1 },
  { "x": "1" }
]
$ echo '{"x": null}\n{"x": "1"}' | fx -s '?.x <= 1'
[
  { "x": null },
  { "x": "1" }
]

danvk avatar Sep 12 '25 18:09 danvk

Hmmm, I guess we can start with just ==. Maybe add .includes() as well.

antonmedv avatar Sep 12 '25 18:09 antonmedv