torcharrow
torcharrow copied to clipboard
High performance model preprocessing library on PyTorch
We should remove the `_offset` and `_length` in `BaseColumn`: https://github.com/facebookresearch/torcharrow/blob/d680bfdc0f6a6bb6c3a29c2a67d62006782d6558/csrc/velox/column.h#L223-L224 There are multiple places where we do not properly track this, such as in expression evaluation: https://github.com/facebookresearch/torcharrow/blob/d680bfdc0f6a6bb6c3a29c2a67d62006782d6558/csrc/velox/column.cpp#L236-L238 We should be...
# Current Status In TorchArrow, the interface names are `ta.IDataFrame/ta.IColumn` while the factory methods are `ta.DataFrame`/`ta.Column`: ```python import torcharrow as ta a = ta.Column([1, 2, 3]) assert isinstance(a, ta.IColumn) assert...
If users create a column from a Python list, we actually dispatch that directly to C++. For example, ``` vals = [1, 2, 3, 4, 5] col = ta.Column(vals, device="cpu")...
Column construction from list is optimized with native C++ code (for scalar types), e.g. ```python import torcharrow as ta a = ta.Column([1, 2, 3]) ``` This optimization is not done...
# Native kernel binding for cast in CPU backend ## Background: TorchArrow Native Kernel Dispatch For efficiency, a lot of TorchArrow operations (e.g. `INumerialColumn.abs()`) is dispatched to the Velox C++...
Motivation example (the actual dataset has two struct columns, with 13 and 26 fields respectively) : ```python dtype = dt.Struct( [ dt.Field("labels", dt.int8), dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])), ] )...
See https://github.com/facebookresearch/torcharrow/pull/100 for details Another wild idea is implement `to_pylist` at C++ `BaseColumn` level so a Python object is constructed recursively in C++ code.
IColumn.`sum`/`mean`/`std`/`median`/`quantile`/`mode`/`all`/`any`: https://github.com/facebookresearch/torcharrow/blob/380e1cbaf334b49d52242596c79627d456ef3b0d/torcharrow/icolumn.py#L1206-L1292 Also remove the Python implementation in cpu backend (if there is). Since once zero-copy interop with Arrow is implemented, it's more efficient to use Arrow Compute. Eventually we...
For single colulmn, delegating to Arrow Array seems to be a good initial support. Arrow array supports `fill_null/drop_null`. So we can first call `to_arrow`, then calls `fill_null/drop_null` in Arrow array,...