excelpython
excelpython copied to clipboard
Handling very long arrays
I have recently been experimenting with different ways of dealing with very long arrays. The summary below is partly for the benefit of others, but also to see if there are any other ways to speed things up.
The background is that when transferring very long arrays between Excel and Python everything works much quicker with a vector array, or a single row array, rather than a 2D array with many rows, even if it is a single column. One way around this is to use the VBA Transpose function to convert a single column into a single row, but this doesn't work if there are more than 2^ 16 rows.
The options I have looked at are: In Python I have two functions: @xlfunc @xlarg("x", "nparray", dims=1) @xlarg("y", "nparray", dims=1) def AddVect(x, y): x = x+y return x
@xlfunc
def AddVect2(x, y):
x = np.array(x)
y = np.array(y)
x = x+y
return x
Using these I have written six VBA UDFs that transfer 2 complete columns (i.e. 1048576 rows), read the result of the Python function, and record the time taken for each step.
1: VBA code generated by ExcelPython, plus create a single column array and transfer the contents of the Python return array with a loop.
- Transfer the Excel arrays as single column variant arrays, convert to numpy arrays in Python with AddVect2. This returns a single column array, that does not need to be transposed.
- Convert to vector arrays of doubles with a loop in VBA, then convert to Numpy arrays in VBA Convert results to column array with a loop.
- As 3 but transfer as double arrays, and convert to Numpy arrays in Python.
- As 4, but convert column arrays to single row arrays in VBA
- As 5, but using variant arrays, rather than doubles.
Total times for each function were:
1: 1.34 sec. 2: 2.28 sec 3: 1.95 sec 4: 0.79 sec 5: 0.82 sec 6: 1.36 sec
So conclusions are:
- Passing long arrays as vectors, or single row arrays, is much faster than passing single column arrays.
- Function 3 was presumably doubling up some of the operations done in 1, hence the slower time.
- Creating vector arrays of doubles (functions 4 and 5) was significantly faster than converting variant arrays to numpy arrays in VBA, in spite of the need to loop through 1 million+ rows at each end of the operation.
- Looping with variant arrays (Function 6) was very similar in time to the ExcelPython code.
- Even with Functions 4 and 5 the data transpose and transfer operations were hugely slower than the addition of the two vectors in Python, which took just a few milliseconds.
So the message is, if you have long column arrays transpose then into double vector arrays before transferring to Python, then convert to Numpy arrays in Python.
And the question is, is there any way to further reduce the overhead of the data transfer?
Uff tough GitHub issue! I will need to try to get my head around everything you wrote, in fact sorry for the slow reply.
So I will try to investigate further when I have some time, however generally speaking data transfer is not as fast as it could be because it all passes through COM/PyWin32. Just consider that this means that VBA arrays, which in memory are stored as real contiguous raw data arrays just like Numpy, can only be sent to Python as tuples of objects (1D) or tuples of tuples of objects (2D).
But the only way to get around this limitation to to rewrite the way that Excel and Python communicate, i.e. by substituting COM and PyWin32 with some other customized interprocess protocol. This is an idea I have had in the back of my mind for quite a while, but it's a lot of work and not getting it 100% right would lead to the library itself being less stable.
Thanks for the reply Eric.
I can certainly relate to having projects I would like to do, if only I had the time!
To put this in perspective, transferring over 1 million rows is certainly an extreme case. For the applications I have in mind even very large data sets would be more likely in the 10,000 to 100,000 row range, and the processing in Python/SciPy would involve a lot more than adding two vectors together, so the data transfer time is not all that significant.
That said, if there is a way to speed things up without losing stability, that would be appreciated!