river
river copied to clipboard
`RollingQuantile` performance with large windows
Versions
river
version: 0.10.1
Python version: 3.10.2
Describe your task
Keeping track of a rolling quantile over a large window. Once the RollingQuantile reaches the windows size, it removes the oldest value and appends the new value. For 'small' windows ( ~ < 100k ) this is no problem and the code runs really fast, but with larger window sizes (say, 500k) once the window is full, there is a significant performance drop when updating the RollingQuantile
. I think the issue is primarily in the underlying utils.SortedWindow
.
Furthermore, though the RollingQuantile
is indeed a streaming algorithm, but there may be some downsides on the memory and computational complexity side when handling large windows (O(n)
). Because the entire window is kept in memory and being used to calculate the quantile, it's really accurate. However, maybe we could trade in some of this accuracy for performance, e.g. by estimating the quantile just like the regular Quantile
class, but then windowed (although I currently don't know of a specific algorithm that solves this issue, there are a few papers out there though).
What kind of performance are you expecting?
Until the window is full: 347195 iterations/s Once the window is full: 279 iterations/s
Steps/code to reproduce
from river.stats import RollingQuantile
from tqdm import tqdm
q = 0.997
rolling_q = RollingQuantile(window_size=500_000, q=q)
for x in tqdm(range(500_000)):
rolling_q.update(x)
for x in tqdm(range(500_000)):
rolling_q.update(x)
Hi @fox-ds, this is indeed problematic from a computational point of view. However, there is not much we can do currently.
IMHO, the main problem is that although multiple quantile estimators enable adding new points, removing old data is problematic. For instance, the P2 algorithm, implemented under stats.Quantile
, does not allow "reverting". I am not sure if it is possible to modify the algorithm to enable that, nor if other more robust solutions (e.g., t-digest) support such operation.
In the end, SortedWindow
does what the name implies. So every new element addition/removal implies ~re-sorting~ keeping the window elements sorted. I would love to see a "reversible" incremental estimator of quantiles, and I have to admit that I am totally noob in that matter. Maybe this solution is out there and it is simple.
Edit: as @AdilZouitine pointed out, the major bottleneck comes from handling the updates in the sorted list. Even though the element to delete can be found efficiently using binary search (and that already gives us a huge boost in time), list element removal is O(n)
.
Hi @fox-ds, Thank you for pointing out this issue. I think I have a satisfying but not perfect solution.
Context:
from river.stats import RollingQuantile
from tqdm import tqdm
q = 0.997
rolling_q = RollingQuantile(window_size=500_000, q=q)
Under the hood when the sliding window is not full, there is no performance problem.
for x in tqdm(range(500_000)):
rolling_q.update(x)
23494995 function calls (22994995 primitive calls) in 10.391 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
8475732 3.845 0.000 4.813 0.000 __init__.py:1094(__getitem__)
500000 2.697 0.000 8.113 0.000 {built-in method _bisect.insort_left}
8476043 0.969 0.000 0.969 0.000 {built-in method builtins.isinstance}
500000 0.617 0.000 9.468 0.000 sorted_window.py:46(append)
1000000 0.462 0.000 0.573 0.000 __init__.py:1093(__len__)
1500112/1000112 0.305 0.000 0.537 0.000 {built-in method builtins.len}
1 0.278 0.278 10.391 10.391 <string>:2(<module>)
500000 0.239 0.000 0.319 0.000 __init__.py:1134(insert)
500000 0.207 0.000 9.676 0.000 quantile.py:236(update)
500000 0.199 0.000 0.199 0.000 sorted_window.py:42(size)
500001 0.194 0.000 0.430 0.000 std.py:1174(__iter__)
835 0.098 0.000 0.098 0.000 {method 'acquire' of '_thread.lock' objects}
415 0.095 0.000 0.095 0.000 socket.py:480(send)
500000 0.081 0.000 0.081 0.000 {method 'insert' of 'list' objects}
500494 0.056 0.000 0.056 0.000 {method 'append' of 'collections.deque' objects}
....
However, when the sliding window is full, the time to remove an item in the sorted list is high, curious isn't it?
for x in tqdm(range(100_000)):
rolling_q.update(x)
5613486 function calls (5513486 primitive calls) in 133.562 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100000 97.451 0.001 97.451 0.001 {method 'remove' of 'list' objects}
100000 26.461 0.000 26.461 0.000 {method 'insert' of 'list' objects}
100000 1.731 0.000 99.182 0.001 __init__.py:1136(remove)
5067 1.307 0.000 1.307 0.000 socket.py:480(send)
9906 1.174 0.000 1.174 0.000 {method 'acquire' of '_thread.lock' objects}
100000 1.117 0.000 29.712 0.000 {built-in method _bisect.insort_left}
1895145 1.113 0.000 1.403 0.000 __init__.py:1094(__getitem__)
100000 0.673 0.000 129.876 0.001 sorted_window.py:46(append)
100000 0.506 0.000 26.968 0.000 __init__.py:1134(insert)
1898943 0.291 0.000 0.291 0.000 {built-in method builtins.isinstance}
100001 0.256 0.000 3.318 0.000 std.py:1174(__iter__)
200000 0.255 0.000 0.319 0.000 __init__.py:1093(__len__)
1 0.235 0.235 133.562 133.562 <string>:2(<module>)
...
One of the culprits is there!
https://github.com/online-ml/river/blob/c0c88d38b96c99de3e3646005c3ded1bfee47e47/river/utils/sorted_window.py#L49
Self is a list so the average complexity for deleting data is O(n). The algorithm is quite naive: Loop over the list until you find the value to delete (average complexity O(n)), delete it, and then shift all the elements to the right by one place to the left (complexity O(n)) We know that self is a sorted list, so we can use this a priori to search for the element to delete by dichotomy and change the cost from O(n) to O(log_{2}(n)). Then, we know that self is a sorted list, so we can use this a priori to search for the element to delete by dichotomy and change the cost from O(n) to O(log_{2}(n)).
So I coded a fix to replace the linear search with a binary search and below you will have some benchmarks. The next step I will do is to replace this sorted list with a sorted binary tree.
import bisect
import collections
class SortedWindowFix(collections.UserList):
def __init__(self, size: int):
super().__init__()
self.unsorted_window = collections.deque(maxlen=size)
@property
def size(self):
return self.unsorted_window.maxlen
def append(self, x):
if len(self) >= self.size:
start_deque = bisect.bisect_left(self, self.unsorted_window[0])
del self[start_deque]
bisect.insort_left(self, x)
self.unsorted_window.append(x)
return self
class RollingQuantileFix(RollingQuantile):
def __init__(self, q, window_size):
self.window = SortedWindowFix(size=window_size)
self.q = q
idx = self.q * (self.window.size - 1)
self._lower = int(math.floor(idx))
self._higher = self._lower + 1
if self._higher > self.window.size - 1:
self._higher = self.window.size - 1
self._frac = idx - self._lower
rqf = RollingQuantileFix(window_size=500_000, q=q)
for x in tqdm(range(500_000)):
rqf.update(x)
23500470 function calls (23000470 primitive calls) in 10.968 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
8475732 4.176 0.000 5.194 0.000 __init__.py:1094(__getitem__)
500000 2.771 0.000 8.598 0.000 {built-in method _bisect.insort_left}
8476047 1.019 0.000 1.019 0.000 {built-in method builtins.isinstance}
500000 0.665 0.000 10.044 0.000 <ipython-input-14-4ef363193f8b>:39(append)
1000000 0.488 0.000 0.609 0.000 __init__.py:1093(__len__)
1500106/1000106 0.335 0.000 0.579 0.000 {built-in method builtins.len}
1 0.254 0.254 10.968 10.968 <string>:2(<module>)
500000 0.245 0.000 0.327 0.000 __init__.py:1134(insert)
500001 0.218 0.000 0.454 0.000 std.py:1174(__iter__)
...
for x in tqdm(range(100_000)):
rqf.update(x)
9394169 function calls (9294169 primitive calls) in 44.578 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100000 23.593 0.000 23.593 0.000 {method 'insert' of 'list' objects}
100000 14.345 0.000 14.345 0.000 __init__.py:1100(__delitem__)
3790291 1.913 0.000 2.382 0.000 __init__.py:1094(__getitem__)
100000 0.839 0.000 26.193 0.000 {built-in method _bisect.insort_left}
100000 0.633 0.000 1.867 0.000 {built-in method _bisect.bisect_left}
100000 0.537 0.000 43.203 0.000 <ipython-input-14-4ef363193f8b>:39(append)
1767 0.476 0.000 0.476 0.000 socket.py:480(send)
3791614 0.469 0.000 0.469 0.000 {built-in method builtins.isinstance}
100000 0.413 0.000 24.007 0.000 __init__.py:1134(insert)
3325 0.358 0.000 0.358 0.000 {method 'acquire' of '_thread.lock' objects}
300000 0.229 0.000 0.284 0.000 __init__.py:1093(__len__)
..
Tadaaaa x3 speed boost !
If the sorted binary tree is better I will make another comment
Hey @AdilZouitine, would it be possible to keep a full-size binary tree in memory, even if the window is not full? Then instead of removing elements, we could mark them for removal. The tree would need to be rebalanced/rebuilt from time to time, which might become a problem.
I am not so sure about the rebuild/rebalancing frequency and whether or not this would become an extra parameter (which is not a good thing, IMHO), but we could think about something like that for the future.
Thanks for the suggestions all, the x3 speed boost already helps a lot! @smastelini @AdilZouitine I think that indeed keeping a binary tree in memory would be the ideal solution, perhaps we could use the heapq
library for designing such a class. I found this blogpost that tries to achieve something similar (be it for the median only), but maybe we could use that for inspiration. I've also started looking into this issue myself, maybe will come up with a solution later. Again, thanks a lot for the help!
I found this blog post which is similar -- it's based on heaps. But it looks complex, so someone needs to dive into it.
Thanks for the link @MaxHalford. The past week I worked on what at first glance looks like a similar solution - using two heaps that are 'balanced' around the quantile that you want to estimate. I managed to get it up and running really efficiently with window_size=500_000
, however (in my implementation) the memory usage was still quite high. So I'll have a look at this blog post as well.
Interesting @fox-ds! Do you have some comparison figures to share?
Not yet, but I want to continue with it this week and probably share some code/figures once I'm satisfied with the results.
Hey @fox-ds and @MaxHalford, I've read both blog posts and I really like those ideias! The more I learn, the more I realize I know nothing about anything.
The reference C# implementation @MaxHalford sent also generalizes to arbitrary quantiles :D
I did continue working on this a few weeks ago, but still need some more time to finalize the results
We've just released version 0.13 which includes faster implementations of quantile algorithms, thanks to @AdilZouitine. I think we can now close this issue.
Nice!!!