Quadratic complexity for adding binary data
What version are you using (fq -v)?
$ fq -v 0.15.1 (linux amd64 go1.24.5)
How was fq installed?
From Arch Linux repository:
pacman -S fq
What did you do?
I tried to find out the complexity of the addition operation for binary data in fq.
What result did you expect?
I made a little jq prototype myself that has binary data support. Here, adding binary data takes linear time:
$ hyperfine -M 2 -L n 10000,100000,1000000 "cargo run --release -p jaq -- -n 'add(limit({n}; repeat(0 | tobytes))) | empty'"
Benchmark 1: cargo run --release -p jaq -- -n 'add(limit(10000; repeat(0 | tobytes))) | empty'
Time (mean ± σ): 193.1 ms ± 155.1 ms [User: 65.7 ms, System: 25.0 ms]
Range (min … max): 83.4 ms … 302.8 ms 2 runs
Benchmark 2: cargo run --release -p jaq -- -n 'add(limit(100000; repeat(0 | tobytes))) | empty'
Time (mean ± σ): 173.6 ms ± 54.4 ms [User: 108.2 ms, System: 31.1 ms]
Range (min … max): 135.1 ms … 212.0 ms 2 runs
Benchmark 3: cargo run --release -p jaq -- -n 'add(limit(1000000; repeat(0 | tobytes))) | empty'
Time (mean ± σ): 661.6 ms ± 0.0 ms [User: 627.4 ms, System: 32.7 ms]
Range (min … max): 661.5 ms … 661.6 ms 2 runs
What did you see instead?
In fq, the complexity seems to be at least quadratic. This can be seen between the second and third benchmark: Multiplying the number of input elements by 10 multiplies the runtime by more than 40. Compared to my prototype, the last benchmark is also more than 100 times slower in fq.
$ hyperfine -M 2 -L n 10000,100000,1000000 "fq -n 'add(limit({n}; repeat(0 | tobytes))) | empty'"
Benchmark 1: fq -n 'add(limit(10000; repeat(0 | tobytes))) | empty'
Time (mean ± σ): 146.1 ms ± 0.9 ms [User: 298.1 ms, System: 19.9 ms]
Range (min … max): 145.4 ms … 146.7 ms 2 runs
Benchmark 2: fq -n 'add(limit(100000; repeat(0 | tobytes))) | empty'
Time (mean ± σ): 1.972 s ± 0.001 s [User: 7.597 s, System: 0.162 s]
Range (min … max): 1.971 s … 1.973 s 2 runs
Benchmark 3: fq -n 'add(limit(1000000; repeat(0 | tobytes))) | empty'
Time (mean ± σ): 86.166 s ± 0.195 s [User: 520.231 s, System: 2.294 s]
Range (min … max): 86.028 s … 86.304 s 2 runs
My guess is that when fq does l + r for binary data l, r, it creates a new l' = l in memory distinct from l, which results in quadratic runtime.
jq avoids this e.g. for string types, by "reusing" the l if the current position is the only place where l is referenced:
$ hyperfine -M 2 -L n 10000,100000,1000000 "jq -n 'add(limit({n}; repeat([0] | implode))) | empty'"
Benchmark 1: jq -n 'add(limit(10000; repeat([0] | implode))) | empty'
Time (mean ± σ): 10.1 ms ± 0.3 ms [User: 9.7 ms, System: 0.2 ms]
Range (min … max): 9.9 ms … 10.3 ms 2 runs
Benchmark 2: jq -n 'add(limit(100000; repeat([0] | implode))) | empty'
Time (mean ± σ): 83.2 ms ± 1.1 ms [User: 82.0 ms, System: 1.1 ms]
Range (min … max): 82.5 ms … 84.0 ms 2 runs
Benchmark 3: jq -n 'add(limit(1000000; repeat([0] | implode))) | empty'
Time (mean ± σ): 793.8 ms ± 3.3 ms [User: 786.3 ms, System: 4.2 ms]
Range (min … max): 791.5 ms … 796.2 ms 2 runs
Summary
jq -n 'add(limit(10000; repeat([0] | implode))) | empty' ran
8.22 ± 0.26 times faster than jq -n 'add(limit(100000; repeat([0] | implode))) | empty'
78.44 ± 2.27 times faster than jq -n 'add(limit(1000000; repeat([0] | implode))) | empty'
Here, we see linear runtime as well. The same works in fq, too. So perhaps this technique could be applied to binary data in fq as well.
Thanks for the report! yeap i think analysis is correct. Had a quick look some weeks ago but forgot to comment. I think the conclusion was that this probably would require some changes to gojq to be more efficient, also not sure how it would work in gojq as there is no explicit reference count, i guess in jaq you check that the count is 1 etc? but maybe there are other tricks
i guess in jaq you check that the count is 1 etc?
That happens under the hood, yes: The code in question calls Rc::make_mut, which will retrieve the actual underlying object if we hold a unique reference to it, and otherwise, it clones the object. I unfortunately also do not know whether this kind of optimisation is possible with Go ...