python-sortedcontainers icon indicating copy to clipboard operation
python-sortedcontainers copied to clipboard

unpickling very slow

Open ChrisJefferson opened this issue 1 year ago • 5 comments

Hi,

I have some code where unpickling SortedLists and SortedSets takes a significant amount of the runtime of my program (I have a lot of them!)

The problem is, I think, that as __reduce__ just says that set and key should be passed to init, then the iterable values then get fed into self._update one at a time.

Would it be possible to speed things up, as the iterable is already known to be sorted? I wasn't sure of the best way to do this, one option would be to add a constructor method issorted, but then that might get used by other people (and maybe incorrectly), and you might not want that?

ChrisJefferson avatar Sep 01 '22 11:09 ChrisJefferson

One option (based on my limited Python experience) would be to switch to using setstate and getstate, and then set up the special methods in setstate?

ChrisJefferson avatar Sep 01 '22 12:09 ChrisJefferson

The update method has a fast path for when it’s initially empty.

Please provide a simple benchmark to reproduce the issue.

grantjenks avatar Sep 01 '22 15:09 grantjenks

Here is a simple benchmark (I put a class in just to check if classes aren't always slow). The results are get are:

1.4263453999956255 0.07709820000309264 0.10215669999888632

So it is 20 times slower to unpickle a SortedSet than a set, and 14 times slow to unpickle a set with a single member. However, that might be considered a reasonable overhead -- in my case it is annoying as I'm using pickle to store a very large datastructure, and my program is "unpickle data structure, do some AI with it, close program".

Code:

from sortedcontainers import SortedSet 
import pickle
import timeit

class Stupid:
    def __init__(self,l):
        self.a = list(l)

s1 = list([SortedSet(range(i, i+10)) for i in range(10000)])
s2 = list([set(range(i, i+10)) for i in range(10000)])
s3 = list([Stupid(range(i, i+10)) for i in range(10000)])

p1 = pickle.dumps(s1)
p2 = pickle.dumps(s2)
p3 = pickle.dumps(s3)

def func1():
    pickle.loads(p1)

def func2():
    pickle.loads(p2)

def func3():
    pickle.loads(p3)

time1 = timeit.Timer(func1).timeit(number=10)
time2 = timeit.Timer(func2).timeit(number=10)
time3 = timeit.Timer(func3).timeit(number=10)

print(f"{time1} {time2} {time3}")

ChrisJefferson avatar Sep 01 '22 17:09 ChrisJefferson

Hi,

Thanks a lot for this project!

I face a similar issue when loading a pickled SortedSet. I agree with Chris that the issue is that during loading, it recreates the SortedSet, which is an expensive operation.

Are you open to a PR?

kPsarakis avatar Mar 03 '23 14:03 kPsarakis

I can’t promise it’ll be merged but I’m certainly happy to review a solution.

grantjenks avatar Mar 03 '23 15:03 grantjenks