torcharrow icon indicating copy to clipboard operation
torcharrow copied to clipboard

Efficient column construction from tuple

Open wenleix opened this issue 3 years ago • 1 comments
trafficstars

Column construction from list is optimized with native C++ code (for scalar types), e.g.

import torcharrow as ta
a = ta.Column([1, 2, 3])

This optimization is not done for tuple (so construction from tuple still has O(n^2) behavior ):

import torcharrow as ta
a = ta.Column((1, 2, 3))

Both Pandas and PyArrow supports that, so a feature we do want to keep:

>>> import pandas as pd
>>> a = pd.Series((1, 2, 3))
>>> a
0    1
1    2
2    3
dtype: int64

This is actually quite useful since sometimes user may create the data from a list of tuple using zip, e.g.

>>> a = [("a", 1), ("b", 2), ("c", 3)]
>>> list(zip(*a))
[('a', 'b', 'c'), (1, 2, 3)]

I guess the easiest way would be to convert Tuple to list in Python. Not sure the performance comparing with handle tuple in C++ directly.

wenleix avatar Jan 27 '22 05:01 wenleix

pybind11 exposes a py::tuple type on the C++ side, so this should probably be trivial for us to support in the same way we do for lists. I'll investigate.

scotts avatar Jan 28 '22 21:01 scotts