isolation-forest icon indicating copy to clipboard operation
isolation-forest copied to clipboard

Improve performance for SparseVector

Open eisber opened this issue 4 years ago • 2 comments

eisber avatar Feb 14 '20 13:02 eisber

@eisber: Thanks for creating this PR!

We had a similar discussion internally (DataPoint case class vs. Array[Array[Float]]) back when I was creating the library. We decided on the DataPoint case class as it is more readable and the difference in memory usage was negligible (for our use cases DataPoint case class used ~6% more memory than Array[Array[Float]]).

Are there any data / benchmarks to indicate that there is a major performance improvement using Vector instead of the DataPoint case class?

jverbus avatar Feb 19 '20 07:02 jverbus

I'll try to setup a benchmark. It's not about DataPoint vs Array[Float], it's really about the expansion of SparseVector to dense Array. Imagine you have a SparseVector with 27 features, but only 4 active feature. Since everything is converted to a dense array, allocation is happening for every row processed (apart from the wasted space). Obviously there is a trade-off, as array access is more expensive as SparseVector performs binary search.

eisber avatar Feb 19 '20 07:02 eisber