river `RollingQuantile` performance with large windows

`RollingQuantile` performance with large windows

Open fox-ds opened this issue 2 years ago • 10 comments

Versions

river version: 0.10.1 Python version: 3.10.2

Describe your task

Keeping track of a rolling quantile over a large window. Once the RollingQuantile reaches the windows size, it removes the oldest value and appends the new value. For 'small' windows ( ~ < 100k ) this is no problem and the code runs really fast, but with larger window sizes (say, 500k) once the window is full, there is a significant performance drop when updating the RollingQuantile. I think the issue is primarily in the underlying utils.SortedWindow.

Furthermore, though the RollingQuantile is indeed a streaming algorithm, but there may be some downsides on the memory and computational complexity side when handling large windows (O(n)). Because the entire window is kept in memory and being used to calculate the quantile, it's really accurate. However, maybe we could trade in some of this accuracy for performance, e.g. by estimating the quantile just like the regular Quantile class, but then windowed (although I currently don't know of a specific algorithm that solves this issue, there are a few papers out there though).

What kind of performance are you expecting?

Until the window is full: 347195 iterations/s Once the window is full: 279 iterations/s

Steps/code to reproduce

from river.stats import RollingQuantile
from tqdm import tqdm

q = 0.997

rolling_q = RollingQuantile(window_size=500_000, q=q)

for x in tqdm(range(500_000)):
    rolling_q.update(x)

for x in tqdm(range(500_000)):
    rolling_q.update(x)

Apr 06 '22 12:04 fox-ds

Hi @fox-ds, this is indeed problematic from a computational point of view. However, there is not much we can do currently.

IMHO, the main problem is that although multiple quantile estimators enable adding new points, removing old data is problematic. For instance, the P2 algorithm, implemented under stats.Quantile, does not allow "reverting". I am not sure if it is possible to modify the algorithm to enable that, nor if other more robust solutions (e.g., t-digest) support such operation.

In the end, SortedWindow does what the name implies. So every new element addition/removal implies ~re-sorting~ keeping the window elements sorted. I would love to see a "reversible" incremental estimator of quantiles, and I have to admit that I am totally noob in that matter. Maybe this solution is out there and it is simple.

Edit: as @AdilZouitine pointed out, the major bottleneck comes from handling the updates in the sorted list. Even though the element to delete can be found efficiently using binary search (and that already gives us a huge boost in time), list element removal is O(n).

Apr 06 '22 14:04 smastelini

Hi @fox-ds, Thank you for pointing out this issue. I think I have a satisfying but not perfect solution.

Context:

from river.stats import RollingQuantile
from tqdm import tqdm

q = 0.997

rolling_q = RollingQuantile(window_size=500_000, q=q)

Under the hood when the sliding window is not full, there is no performance problem.

for x in tqdm(range(500_000)):
    rolling_q.update(x)

       23494995 function calls (22994995 primitive calls) in 10.391 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  8475732    3.845    0.000    4.813    0.000 __init__.py:1094(__getitem__)
   500000    2.697    0.000    8.113    0.000 {built-in method _bisect.insort_left}
  8476043    0.969    0.000    0.969    0.000 {built-in method builtins.isinstance}
   500000    0.617    0.000    9.468    0.000 sorted_window.py:46(append)
  1000000    0.462    0.000    0.573    0.000 __init__.py:1093(__len__)
1500112/1000112    0.305    0.000    0.537    0.000 {built-in method builtins.len}
        1    0.278    0.278   10.391   10.391 <string>:2(<module>)
   500000    0.239    0.000    0.319    0.000 __init__.py:1134(insert)
   500000    0.207    0.000    9.676    0.000 quantile.py:236(update)
   500000    0.199    0.000    0.199    0.000 sorted_window.py:42(size)
   500001    0.194    0.000    0.430    0.000 std.py:1174(__iter__)
      835    0.098    0.000    0.098    0.000 {method 'acquire' of '_thread.lock' objects}
      415    0.095    0.000    0.095    0.000 socket.py:480(send)
   500000    0.081    0.000    0.081    0.000 {method 'insert' of 'list' objects}
   500494    0.056    0.000    0.056    0.000 {method 'append' of 'collections.deque' objects}
....

However, when the sliding window is full, the time to remove an item in the sorted list is high, curious isn't it?

for x in tqdm(range(100_000)):
    rolling_q.update(x)

         5613486 function calls (5513486 primitive calls) in 133.562 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100000   97.451    0.001   97.451    0.001 {method 'remove' of 'list' objects}
   100000   26.461    0.000   26.461    0.000 {method 'insert' of 'list' objects}
   100000    1.731    0.000   99.182    0.001 __init__.py:1136(remove)
     5067    1.307    0.000    1.307    0.000 socket.py:480(send)
     9906    1.174    0.000    1.174    0.000 {method 'acquire' of '_thread.lock' objects}
   100000    1.117    0.000   29.712    0.000 {built-in method _bisect.insort_left}
  1895145    1.113    0.000    1.403    0.000 __init__.py:1094(__getitem__)
   100000    0.673    0.000  129.876    0.001 sorted_window.py:46(append)
   100000    0.506    0.000   26.968    0.000 __init__.py:1134(insert)
  1898943    0.291    0.000    0.291    0.000 {built-in method builtins.isinstance}
   100001    0.256    0.000    3.318    0.000 std.py:1174(__iter__)
   200000    0.255    0.000    0.319    0.000 __init__.py:1093(__len__)
        1    0.235    0.235  133.562  133.562 <string>:2(<module>)
...

One of the culprits is there!

https://github.com/online-ml/river/blob/c0c88d38b96c99de3e3646005c3ded1bfee47e47/river/utils/sorted_window.py#L49

Self is a list so the average complexity for deleting data is O(n). The algorithm is quite naive: Loop over the list until you find the value to delete (average complexity O(n)), delete it, and then shift all the elements to the right by one place to the left (complexity O(n)) We know that self is a sorted list, so we can use this a priori to search for the element to delete by dichotomy and change the cost from O(n) to O(log_{2}(n)). Then, we know that self is a sorted list, so we can use this a priori to search for the element to delete by dichotomy and change the cost from O(n) to O(log_{2}(n)).

So I coded a fix to replace the linear search with a binary search and below you will have some benchmarks. The next step I will do is to replace this sorted list with a sorted binary tree.

import bisect
import collections


class SortedWindowFix(collections.UserList):

    def __init__(self, size: int):
        super().__init__()
        self.unsorted_window = collections.deque(maxlen=size)

    @property
    def size(self):
        return self.unsorted_window.maxlen

    def append(self, x):

        if len(self) >= self.size:
            start_deque = bisect.bisect_left(self, self.unsorted_window[0])
            del self[start_deque]

        bisect.insort_left(self, x)
        self.unsorted_window.append(x)

        return self

class RollingQuantileFix(RollingQuantile):
    def __init__(self, q, window_size):
        self.window = SortedWindowFix(size=window_size)
        self.q = q
        idx = self.q * (self.window.size - 1)

        self._lower = int(math.floor(idx))
        self._higher = self._lower + 1
        if self._higher > self.window.size - 1:
            self._higher = self.window.size - 1
        self._frac = idx - self._lower

rqf = RollingQuantileFix(window_size=500_000, q=q)

for x in tqdm(range(500_000)):
    rqf.update(x)

         23500470 function calls (23000470 primitive calls) in 10.968 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  8475732    4.176    0.000    5.194    0.000 __init__.py:1094(__getitem__)
   500000    2.771    0.000    8.598    0.000 {built-in method _bisect.insort_left}
  8476047    1.019    0.000    1.019    0.000 {built-in method builtins.isinstance}
   500000    0.665    0.000   10.044    0.000 <ipython-input-14-4ef363193f8b>:39(append)
  1000000    0.488    0.000    0.609    0.000 __init__.py:1093(__len__)
1500106/1000106    0.335    0.000    0.579    0.000 {built-in method builtins.len}
        1    0.254    0.254   10.968   10.968 <string>:2(<module>)
   500000    0.245    0.000    0.327    0.000 __init__.py:1134(insert)
   500001    0.218    0.000    0.454    0.000 std.py:1174(__iter__)
...

for x in tqdm(range(100_000)):
    rqf.update(x)

        9394169 function calls (9294169 primitive calls) in 44.578 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100000   23.593    0.000   23.593    0.000 {method 'insert' of 'list' objects}
   100000   14.345    0.000   14.345    0.000 __init__.py:1100(__delitem__)
  3790291    1.913    0.000    2.382    0.000 __init__.py:1094(__getitem__)
   100000    0.839    0.000   26.193    0.000 {built-in method _bisect.insort_left}
   100000    0.633    0.000    1.867    0.000 {built-in method _bisect.bisect_left}
   100000    0.537    0.000   43.203    0.000 <ipython-input-14-4ef363193f8b>:39(append)
     1767    0.476    0.000    0.476    0.000 socket.py:480(send)
  3791614    0.469    0.000    0.469    0.000 {built-in method builtins.isinstance}
   100000    0.413    0.000   24.007    0.000 __init__.py:1134(insert)
     3325    0.358    0.000    0.358    0.000 {method 'acquire' of '_thread.lock' objects}
   300000    0.229    0.000    0.284    0.000 __init__.py:1093(__len__)
..

Tadaaaa x3 speed boost !

If the sorted binary tree is better I will make another comment

Apr 07 '22 08:04 AdilZouitine

Hey @AdilZouitine, would it be possible to keep a full-size binary tree in memory, even if the window is not full? Then instead of removing elements, we could mark them for removal. The tree would need to be rebalanced/rebuilt from time to time, which might become a problem.

I am not so sure about the rebuild/rebalancing frequency and whether or not this would become an extra parameter (which is not a good thing, IMHO), but we could think about something like that for the future.

Apr 07 '22 15:04 smastelini

Thanks for the suggestions all, the x3 speed boost already helps a lot! @smastelini @AdilZouitine I think that indeed keeping a binary tree in memory would be the ideal solution, perhaps we could use the heapq library for designing such a class. I found this blogpost that tries to achieve something similar (be it for the median only), but maybe we could use that for inspiration. I've also started looking into this issue myself, maybe will come up with a solution later. Again, thanks a lot for the help!

Apr 08 '22 08:04 fox-ds

I found this blog post which is similar -- it's based on heaps. But it looks complex, so someone needs to dive into it.

Apr 10 '22 13:04 MaxHalford

Thanks for the link @MaxHalford. The past week I worked on what at first glance looks like a similar solution - using two heaps that are 'balanced' around the quantile that you want to estimate. I managed to get it up and running really efficiently with window_size=500_000, however (in my implementation) the memory usage was still quite high. So I'll have a look at this blog post as well.

Apr 11 '22 07:04 fox-ds

Interesting @fox-ds! Do you have some comparison figures to share?

Apr 11 '22 08:04 MaxHalford

Not yet, but I want to continue with it this week and probably share some code/figures once I'm satisfied with the results.

Apr 11 '22 08:04 fox-ds

Hey @fox-ds and @MaxHalford, I've read both blog posts and I really like those ideias! The more I learn, the more I realize I know nothing about anything.

The reference C# implementation @MaxHalford sent also generalizes to arbitrary quantiles :D

Apr 11 '22 09:04 smastelini

I did continue working on this a few weeks ago, but still need some more time to finalize the results

May 05 '22 07:05 fox-ds

We've just released version 0.13 which includes faster implementations of quantile algorithms, thanks to @AdilZouitine. I think we can now close this issue.

Sep 15 '22 21:09 MaxHalford

Nice!!!

Sep 15 '22 21:09 smastelini

river river copied to clipboard

`RollingQuantile` performance with large windows

Versions

Describe your task

What kind of performance are you expecting?

Steps/code to reproduce

river
river copied to clipboard