redbaron
redbaron copied to clipboard
Inserting a large block is very slow
Inserting a large block into a ProxyList can be very slow because of the _synchronize() method called after each insertion.
Propose an insert_multiple() method that allows multiple insertions at the same time: insert_multiple.patch.txt
Use of this patch can easily give an order of magnitude speedup for use cases such as loop unrolling that involve a large number of insert operations.
Wassap @sdh4, does it take input as a string to convert to nodes?
ps> please insert code with
insert code
--- base_nodes.py.orig 2017-01-02 15:59:58.000000000 -0600
+++ base_nodes.py 2017-11-14 14:16:19.626943779 -0600
@@ -1375,6 +1375,17 @@
self.data.insert(index, [value, None])
self._synchronise()
+ # sdh4 11/14/17 add insert_multiple
+ def insert_multiple(self,index,values):
+ """ Insert multiple values from an iterable into the ProxyList
+ at the specified index, without resynchronizing after each one. """
+ for value in values:
+ value = self._convert_input_to_node_object(value, parent=self.node_list, on_attribute=self.on_attribute)
+ self.data.insert(index, [value, None])
+ index+=1
+ pass
+ self._synchronise()
+
def append(self, value):
self.insert(len(self), value)
What about converting if we wanna do multiple inserts from one RB tree to another RB tree?
Here's an improvement that can handle a broader array of sources (such as fst's) because it uses _convert_input_to_node_object_list().
Not sure how best to do multiple inserts in different places with no synchronization. I guess a data structure with an array of indices and node lists could be used.
Leaving the tree unsync'd seems like it could be a bad idea.
Separately, this seems to trigger an indentation bug which I will list separately.
--- base_nodes.py.orig 2017-01-02 15:59:58.000000000 -0600
+++ base_nodes.py 2017-11-15 09:42:22.360961636 -0600
@@ -1375,6 +1375,17 @@
self.data.insert(index, [value, None])
self._synchronise()
+ # sdh4 11/14/17 add insert_multiple
+ def insert_multiple(self,index,values):
+ """ Insert multiple values from an iterable into the ProxyList
+ at the specified index, without resynchronizing after each one. """
+ values = self._convert_input_to_node_object_list(values, parent=self.node_list, on_attribute=self.on_attribute)
+ for value in values:
+ self.data.insert(index, [value, None])
+ index+=1
+ pass
+ self._synchronise()
+
def append(self, value):
self.insert(len(self), value)
I haven't looked at the LineProxyList algorithm, but I took a pretty close look at the CommaProxyList algorithm a few months ago.
CommaProxyList was super slow for big files while computing the inter-element indentation. (node.indentation is shockingly slow.) I suspect that could be what's happening here. #146 has my changes. A similar change to LineProxyList might fix this and #150.
The speed problem here seems to come from the self._synchronize() call completely rebuilding the inner list and then the node list from the inner list. When you do that iteratively the process ends up O(n^2) which blows up quickly for large n.
The patch above addresses the problem and does not seem to generate any other problems (#150 is a separate and independent problem). Shall I make a pull request?