python-sortedcontainers
python-sortedcontainers copied to clipboard
unpickling very slow
Hi,
I have some code where unpickling SortedLists and SortedSets takes a significant amount of the runtime of my program (I have a lot of them!)
The problem is, I think, that as __reduce__
just says that set
and key
should be passed to init
, then the iterable values then get fed into self._update
one at a time.
Would it be possible to speed things up, as the iterable is already known to be sorted? I wasn't sure of the best way to do this, one option would be to add a constructor method issorted
, but then that might get used by other people (and maybe incorrectly), and you might not want that?
One option (based on my limited Python experience) would be to switch to using setstate and getstate, and then set up the special methods in setstate?
The update method has a fast path for when it’s initially empty.
Please provide a simple benchmark to reproduce the issue.
Here is a simple benchmark (I put a class in just to check if classes aren't always slow). The results are get are:
1.4263453999956255 0.07709820000309264 0.10215669999888632
So it is 20 times slower to unpickle a SortedSet than a set, and 14 times slow to unpickle a set with a single member. However, that might be considered a reasonable overhead -- in my case it is annoying as I'm using pickle to store a very large datastructure, and my program is "unpickle data structure, do some AI with it, close program".
Code:
from sortedcontainers import SortedSet
import pickle
import timeit
class Stupid:
def __init__(self,l):
self.a = list(l)
s1 = list([SortedSet(range(i, i+10)) for i in range(10000)])
s2 = list([set(range(i, i+10)) for i in range(10000)])
s3 = list([Stupid(range(i, i+10)) for i in range(10000)])
p1 = pickle.dumps(s1)
p2 = pickle.dumps(s2)
p3 = pickle.dumps(s3)
def func1():
pickle.loads(p1)
def func2():
pickle.loads(p2)
def func3():
pickle.loads(p3)
time1 = timeit.Timer(func1).timeit(number=10)
time2 = timeit.Timer(func2).timeit(number=10)
time3 = timeit.Timer(func3).timeit(number=10)
print(f"{time1} {time2} {time3}")
Hi,
Thanks a lot for this project!
I face a similar issue when loading a pickled SortedSet
. I agree with Chris that the issue is that during loading, it recreates the SortedSet, which is an expensive operation.
Are you open to a PR?
I can’t promise it’ll be merged but I’m certainly happy to review a solution.