AoC: Repeated list concatenation `l ++ [x]` has performance O(n²)
I ran into this during part 1 of the dec 9 Advent of Code.
This is a well known problem in Haskell Avoiding ++ in Haskell or Why are difference lists more efficient than regular concatenation in Haskell?.
Here is a benchmark that compares the performance of appending single elements to the beginning vs. to the end of a list
_ := for sz := u64 1024, sz |> *2 do
test(op (Sequence i32)->(Sequence i32)) =>
start := time.nano.read
build(s Sequence i32, n u64) Sequence i32 => n = 0 ? s : op (build s n-1)
l := build [] sz
cnt := l.count
delta := time.nano.read-start
(cnt, delta)
cnt1, delta1 := test ([1]++)
cnt2, delta2 := test (++[1])
check
sz.as_i32 = cnt1 = cnt2
out(d) => "{d} {time.duration.ns d.nanos/sz}/it {time.duration.ns d.nanos/sz/sz}/it²"
say "$sz:\t [1]++l: {out delta1} | l++[1]: {out delta2} factor: {delta2.nanos/delta1.nanos}"
until (max delta1 delta2 >= time.duration.seconds 5)
The performance of the former is linear, that of the latter quadratic:
> ~/fuzion/work/build/bin/fz test_perf.fz
1024: [1]++l: 24ms 24µs/it 23ns/it² | l++[1]: 46ms 45µs/it 44ns/it² factor: 1
2048: [1]++l: 2081µs 1016ns/it 0ns/it² | l++[1]: 77ms 38µs/it 18ns/it² factor: 37
4096: [1]++l: 4297µs 1049ns/it 0ns/it² | l++[1]: 184ms 45µs/it 11ns/it² factor: 42
8192: [1]++l: 3284µs 400ns/it 0ns/it² | l++[1]: 821ms 100µs/it 12ns/it² factor: 250
16384: [1]++l: 1781µs 108ns/it 0ns/it² | l++[1]: 2246ms 137µs/it 8ns/it² factor: 1260
32768: [1]++l: 3058µs 93ns/it 0ns/it² | l++[1]: 8150ms 248µs/it 7ns/it² factor: 2664
There is probably no easy solution to this, but it should be ensured that this problem is well documented and that alternatives are explained. What are the standard alternatives suggested in other languages like Haskell? One solution would be to replace appending at the end by appending at the beginning followed by a call to Sequence.reverse.
This problem exists in Java as well: building a string as follows
var s := ""
for(i=0, i<1000, i++)
s := s + i + " "
also has performance in O(n²). The Java solution is StringBuilder and, for lists, ArrayList. So one solution for Fuzion would be to offer some string_builder and sequence_builder features. The tricky part is that this will require either a local mutate instance or some dedicated global builder mutate that is used only in builders. It would be nice to implement it in a way that handling of this mutate instance is transparent, e.g. (with implicit sequence as result)
l := (Sequence i32).build b->
build(n u64) => if n > 0 then b.yield 1; build s n-1
build sz
or (with explicit result created via b.finish
l := (Sequence i32).build b->
build(n u64) => n=0 ? b.finish : {b.yield 1; build s n-1}
build sz
It is important that these builders can be nested without interfering with one another as in
(even,odd) :=
(Sequence u64).build e->
(Sequence u64).build o->
build(n u64) => n=0 ? (e.finish,o.finish)
: {(n%%2 ? e : o).yield n; build s n-1}
build sz
Personally I would prefer to not just document this. I have a hard time imagining that people will not stumble over this again and again.
Random thoughts:
- would be cool to give a guarantee regarding run-time complexity of base.fum to never be O(n²) or worse.
- or to require some effect to enable features with worse runtime complexity
Just came across this post mentioning immutable dequeue. This might provide a solution if a ++ b would, for finite sequences, make sure to first convert the longer sequence into into an immutable dequeue and then push the elements of the shorter sequence to that one. Tricky part might be how to save the elements efficiently, looks as if my ps_array could help here...
This is not fixed yet. #4651 provides some infrastructure that maybe provides a solution, but I would like the example to run fast or the user to be guided such that this performance issue is best avoided.
I get a stackoverflow when running this.
make: Nothing to be done for 'no-java'.
1024: [1]++l: 118ms 115µs/it 112ns/it² | l++[1]: 40ms 39µs/it 38ns/it² factor: 0
2048: [1]++l: 3963µs 1935ns/it 0ns/it² | l++[1]: 13ms 6495ns/it 3ns/it² factor: 3
4096: [1]++l: 5114µs 1248ns/it 0ns/it² | l++[1]: 17ms 4254ns/it 1ns/it² factor: 3
8192: [1]++l: 10ms 1288ns/it 0ns/it² | l++[1]: 32ms 4026ns/it 0ns/it² factor: 3
16384: [1]++l: 59ms 3655ns/it 0ns/it² | l++[1]: 48ms 2946ns/it 0ns/it² factor: 0
error 1: java.lang.StackOverflowError
at fzC_1INTERN_loop0__1test__2build.fzRoutine(/home/sam/playground/test.fz:73)
at fzC_1INTERN_loop0__1test__2build.fzRoutine(/home/sam/playground/test.fz:73)
...
works when using loop instead of recursion in the build feature:
_ := for sz := u64 1024, sz |> *2 do
test(op (Sequence i32)->(Sequence i32)) =>
start := time.nano.read
build(s Sequence i32, n u64) Sequence i32 =>
for cn := n, cn-1
res := s, op res
while cn != 0
else
res
l := build [] sz
cnt := l.count
delta := time.nano.read-start
(cnt, delta)
cnt1, delta1 := test ([1]++)
cnt2, delta2 := test (++[1])
check
sz.as_i32 = cnt1 = cnt2
out(d) => "{d} {time.duration.ns d.nanos/sz}/it {time.duration.ns d.nanos/sz/sz}/it²"
say "$sz:\t [1]++l: {out delta1} | l++[1]: {out delta2} factor: {delta2.nanos/delta1.nanos}"
until (max delta1 delta2 >= time.duration.seconds 5)
``