torcharrow icon indicating copy to clipboard operation
torcharrow copied to clipboard

[API RFC] TorchArrow CAST support

Open wenleix opened this issue 4 years ago • 2 comments

Use Case

In feature pre-processing we often needs to cast feature ID from int64 to int32. The cast often needs to recursively done for complex types, such as

  • List<int64> -> ARRAY<int32>, or even
  • Struct<list<int64>, array<float32>> -> struct<list<int32>, array<float32>>

It looks a good API to support in TorchArrow directly.

CAST in PyArrow/SQL

In ANSI SQL, CAST over array/row type is also defined to applied recursively into each elements. This is also the behavior of Presto.

Today PyArrow supports cast for PyArrow.Array. It works for List but not Struct. PyArrow developer confirmed it's planned to be supported . Here is the JIRA ticket: https://issues.apache.org/jira/browse/ARROW-1888 . See also discussion in mailing list

Proposal

TorchArrow supports IColumn.cast, and the cast will be done recursively per element for List/Struct, similar to PyArrow/SQL.

Current API

We currently have IColumn.astype and the current prototype only supports casts between numeric types: https://github.com/facebookresearch/torcharrow/blob/902f177a8002a71189dcddf931b8484c69a06c6d/torcharrow/icolumn.py#L235-L249

Similar API in other libraries:

  • PyTorch uses Tensor.to : https://pytorch.org/docs/stable/generated/torch.Tensor.to.html
  • Pandas/NumPy use astype: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.astype.html, https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html

cast seems to be a better name since in TorchArrow, the conversion is closer to PyArrow/SQL semantic and may not be trivial (e.g. it may cast from integer to string, or vice-versa). And cast suggests non-trivial conversion. The name to and astype seems to hint more trivial conversion (e.g. conversion between integer and float).

wenleix avatar Oct 28 '21 05:10 wenleix

astype is now renamed as cast in https://github.com/facebookresearch/torcharrow/pull/43

wenleix avatar Jan 20 '22 21:01 wenleix

In PyTorch, there is also Tensor.type_as : https://pytorch.org/docs/stable/generated/torch.Tensor.type_as.html#torch.Tensor.type_as

wenleix avatar Jun 28 '22 18:06 wenleix