torcharrow
torcharrow copied to clipboard
[API RFC] TorchArrow CAST support
Use Case
In feature pre-processing we often needs to cast feature ID from int64 to int32. The cast often needs to recursively done for complex types, such as
List<int64> -> ARRAY<int32>, or evenStruct<list<int64>, array<float32>> -> struct<list<int32>, array<float32>>
It looks a good API to support in TorchArrow directly.
CAST in PyArrow/SQL
In ANSI SQL, CAST over array/row type is also defined to applied recursively into each elements. This is also the behavior of Presto.
Today PyArrow supports cast for PyArrow.Array. It works for List but not Struct. PyArrow developer confirmed it's planned to be supported . Here is the JIRA ticket: https://issues.apache.org/jira/browse/ARROW-1888 . See also discussion in mailing list
Proposal
TorchArrow supports IColumn.cast, and the cast will be done recursively per element for List/Struct, similar to PyArrow/SQL.
Current API
We currently have IColumn.astype and the current prototype only supports casts between numeric types: https://github.com/facebookresearch/torcharrow/blob/902f177a8002a71189dcddf931b8484c69a06c6d/torcharrow/icolumn.py#L235-L249
Similar API in other libraries:
- PyTorch uses Tensor.to : https://pytorch.org/docs/stable/generated/torch.Tensor.to.html
- Pandas/NumPy use
astype: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.astype.html, https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html
cast seems to be a better name since in TorchArrow, the conversion is closer to PyArrow/SQL semantic and may not be trivial (e.g. it may cast from integer to string, or vice-versa). And cast suggests non-trivial conversion. The name to and astype seems to hint more trivial conversion (e.g. conversion between integer and float).
astype is now renamed as cast in https://github.com/facebookresearch/torcharrow/pull/43
In PyTorch, there is also Tensor.type_as : https://pytorch.org/docs/stable/generated/torch.Tensor.type_as.html#torch.Tensor.type_as