torcharrow icon indicating copy to clipboard operation
torcharrow copied to clipboard

Natively support creating a TorchArrow column from a numpy array

Open scotts opened this issue 3 years ago • 1 comments
trafficstars

If users create a column from a Python list, we actually dispatch that directly to C++. For example,

vals = [1, 2, 3, 4, 5]
col = ta.Column(vals, device="cpu")

We dispatch that directly to C++ through pybind11: https://github.com/facebookresearch/torcharrow/blob/d680bfdc0f6a6bb6c3a29c2a67d62006782d6558/csrc/velox/lib.cpp#L135-L141 However, if a user creates a column from a numpy array, we currently have to handle that (slowly) in Python. For example,

vals = [1, 2, 3, 4, 5]
arr = numpy.array(vals)
col = ta.Colmun(arr, device="cpu")

That will be handled only on the Python side: https://github.com/facebookresearch/torcharrow/blob/d680bfdc0f6a6bb6c3a29c2a67d62006782d6558/torcharrow/scope.py#L226-L233 We should be able to handle numpy arrays natively in C++; pybind11 already exposes a numpy array type.

scotts avatar Feb 04 '22 23:02 scotts

Here is the original from_numpy API prototype: https://github.com/facebookresearch/torcharrow/blob/95daa1fabd5a3098be112d487e085e13f5447786/torcharrow/_interop.py#L88-L100

But i don't think we have supported natively in CPU backend (only in the "demo" backend where data is stored as numpy array -- removed in https://github.com/facebookresearch/torcharrow/pull/33)

Some API reference:

  • PyTorch's from_numpy: https://pytorch.org/docs/stable/generated/torch.from_numpy.html
  • Interestingly, for pyarrow, i only found to_numpy but not from_numpy: https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.to_numpy

wenleix avatar Feb 04 '22 23:02 wenleix