bug icon indicating copy to clipboard operation
bug copied to clipboard

optimize Vector concatenation (Vector ++ Vector)

Open scabug opened this issue 14 years ago • 17 comments

currently scala.collection.immutable.Vector.++ simply inherits its implementation from a supertrait (through the stubs added in r24227), so if you concatenate two vectors, it doesn't take advantage that that both are the same structure, but just treats the second vector like any traversable.

in theory this could be O(log n), not O(n), yes? Tiark seems to confirm it in this comment on #3724: "Unfortunately, for implementing a fast concat operation (we hope to do that for 2.8.1 or 2.9) heterogenous Arrays seem to be necessary (we'll be storing Int arrays next to the sub nodes). We might rethink this however, and try to stick to homogeneous Arrays."

I just thought there should be enhancement ticket on this as it would still be nice to have someday.

scabug avatar Apr 04 '11 19:04 scabug

Imported From: https://issues.scala-lang.org/browse/SI-4442?orig=1 Reporter: @SethTisue See #3724

scabug avatar Apr 04 '11 19:04 scabug

@TiarkRompf said: I've been having this on my todo list for quite a while now. The problem is that it requires pervasive changes to the current internal structure and those are hard to get right without slowing down other uses of Vector. Performance will most likely be amortized log n, not worst case. It's good to have a ticket for this, though.

scabug avatar Apr 04 '11 20:04 scabug

@SethTisue said: Is Phil working on this? (the latest Scala meeting notes seem to say so)

scabug avatar Sep 20 '11 21:09 scabug

@TiarkRompf said (edited on Mar 2, 2012 9:10:30 PM UTC): Here's a link to the paper: http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf?version=1

Still need to resolve integration with the current Vector implementation. Maybe it's better to have a separate CatenableVector as a first step.

scabug avatar Mar 02 '12 21:03 scabug

@pchiusano said: Related to this, (1 to N).foldLeft(Vector[Int]())(_ :+ _) runs in linear time, but (1 to N).foldLeft(Vector[Int]())((acc,a) => acc ++ List(a)) is quadratic. This is extremely surprising (I spent several hours tracking down a performance bug in scalaz-stream that ended up being caused by this). Since Vector has a constant time snoc operation, it seems like the default implementation of ++ should be repeated calls to :+, which doesn't require any internal changes to Vector.

scabug avatar Jul 29 '13 00:07 scabug

@SethTisue said: Paul: I'm extremely glad you noticed this. I don't think anyone realized the situation was that bad! I opened a separate ticket on it at #7725.

scabug avatar Aug 07 '13 17:08 scabug

@adriaanm said: was this fixed along with SI-7725?

scabug avatar Feb 10 '14 00:02 scabug

@Ichoran said: This was not fixed--this is about preserving structural sharing across the entire tree, while the other was to at least start from the bigger of two collections to get at least some sharing.

scabug avatar Feb 10 '14 01:02 scabug

@SethTisue said: reassigning to "Backlog" on the grounds that the research on how it would even work hasn't been done yet

scabug avatar Feb 13 '14 21:02 scabug

@Blaisorblade said: Shouldn't RRB-Vector be a potential fix for this, according to the ICFP'15 paper describing it?

https://github.com/nicolasstucki/scala-rrb-vector https://infoscience.epfl.ch/record/213452/files/rrbvector.pdf http://dl.acm.org/citation.cfm?id=2784739

scabug avatar Oct 01 '16 12:10 scabug

These are very relevant! paper describing a potentially new data structure for Vectors: https://infoscience.epfl.ch/record/213452/files/rrbvector.pdf implementation: https://github.com/nicolasstucki/scala-rrb-vector

joshlemer avatar Sep 27 '18 17:09 joshlemer

Is this addressed by any chance in the new collections in 2.13? 🙂

V-Lamp avatar Feb 27 '19 17:02 V-Lamp

@V-Lamp the Vectors in 2.13 will not be the new data structure (RRB Vectors), but their concatenation may be improved enormously by https://github.com/scala/scala/pull/7588 which ensures that during concatenation of vectors, the structure of the left Vector is used rather than copying it into a new Vector. This essentially means that Vector concatenation will be constant-time on the size of the left vector (but still O(n) where n = size of right vector).

joshlemer avatar Feb 27 '19 17:02 joshlemer

https://github.com/scala/scala/pull/7588 was reverted due to https://github.com/scala/bug/issues/11636 but the new Vector implementation that I am working on for 2.14 also has this optimization (appending n elements to a Vector of size m is O(n + log(m+n))). I don't know if I'll find the time to implement it but the same is possible for prepending. The combination of the two would allow concatenation depending on the size of the shorter slice.

szeiger avatar Nov 04 '19 17:11 szeiger

Note that Vector was completely redone for 2.13.2: https://github.com/scala/scala/pull/8534

Has anyone studied what the current situation is w/r/t to this ticket?

SethTisue avatar Apr 26 '21 16:04 SethTisue

I see that on https://github.com/scala/scala/pull/8534 @szeiger wrote,

I did not get around to implementing an optimized concatenation where the right-hand side vector determines the alignment. Like most other algorithms for this data structure it is quite easy to visualize how it works and assure yourself that it is possible but actually implementing it correctly is much harder. This would allow concatenation of two vectors of sizes m and n in O(min(n, m) + log(n+m)) time.

So that sounds like we're still where we before: doing concatenation as efficiently as possible remains future work.

SethTisue avatar Apr 26 '21 16:04 SethTisue

For the record: I've implemented concatenation in O(min(n, m) + log(n+m)) time. (If you're really lucky and the Vector structures are perfectly aligned, it's even only O(log(n+m)).) It's not yet PR-ready, but I hope I'll be able to polish it in a month or so when I'm back from vacation.

ansvonwa avatar Aug 19 '22 09:08 ansvonwa