torcharrow issues

Eliminate offset and length in BaseColumn

We should remove the `_offset` and `_length` in `BaseColumn`: https://github.com/facebookresearch/torcharrow/blob/d680bfdc0f6a6bb6c3a29c2a67d62006782d6558/csrc/velox/column.h#L223-L224 There are multiple places where we do not properly track this, such as in expression evaluation: https://github.com/facebookresearch/torcharrow/blob/d680bfdc0f6a6bb6c3a29c2a67d62006782d6558/csrc/velox/column.cpp#L236-L238 We should be...

scotts

Interface name (`IDataFrame/IColumn`) vs. factory method (`DataFrame/Column`)

# Current Status In TorchArrow, the interface names are `ta.IDataFrame/ta.IColumn` while the factory methods are `ta.DataFrame`/`ta.Column`: ```python import torcharrow as ta a = ta.Column([1, 2, 3]) assert isinstance(a, ta.IColumn) assert...

wenleix

release blocker

Natively support creating a TorchArrow column from a numpy array

1

If users create a column from a Python list, we actually dispatch that directly to C++. For example, ``` vals = [1, 2, 3, 4, 5] col = ta.Column(vals, device="cpu")...

scotts

native kernel

Support TorchArrow column with binary type (SQL VARBINARY, pyarrow.binary)

cc: @wenleix

scotts

Efficient column construction from tuple

1

Column construction from list is optimized with native C++ code (for scalar types), e.g. ```python import torcharrow as ta a = ta.Column([1, 2, 3]) ``` This optimization is not done...

wenleix

Native kernel binding for cast in CPU backend

1

# Native kernel binding for cast in CPU backend ## Background: TorchArrow Native Kernel Dispatch For efficiency, a lot of TorchArrow operations (e.g. `INumerialColumn.abs()`) is dispatched to the Velox C++...

wenleix

native kernel

Optimize creating DataFrame/struct column from a list of tuples

Motivation example (the actual dataset has two struct columns, with 13 and 26 fields respectively) : ```python dtype = dt.Struct( [ dt.Field("labels", dt.int8), dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])), ] )...

wenleix

Potential improvements to to_pylist

See https://github.com/facebookresearch/torcharrow/pull/100 for details Another wild idea is implement `to_pylist` at C++ `BaseColumn` level so a Python object is constructed recursively in C++ code.

wenleix

Default aggregation functions delegates to Arrow Compute

1

IColumn.`sum`/`mean`/`std`/`median`/`quantile`/`mode`/`all`/`any`: https://github.com/facebookresearch/torcharrow/blob/380e1cbaf334b49d52242596c79627d456ef3b0d/torcharrow/icolumn.py#L1206-L1292 Also remove the Python implementation in cpu backend (if there is). Since once zero-copy interop with Arrow is implemented, it's more efficient to use Arrow Compute. Eventually we...

wenleix

good first issue

native kernel

Efficient kernel implementation for `IColumn.fill_null/drop_null`

For single colulmn, delegating to Arrow Array seems to be a good initial support. Arrow array supports `fill_null/drop_null`. So we can first call `to_arrow`, then calls `fill_null/drop_null` in Arrow array,...

wenleix

native kernel

torcharrow
torcharrow copied to clipboard

Metadata

Eliminate offset and length in BaseColumn

Interface name (`IDataFrame/IColumn`) vs. factory method (`DataFrame/Column`)

Natively support creating a TorchArrow column from a numpy array

Support TorchArrow column with binary type (SQL VARBINARY, pyarrow.binary)

Efficient column construction from tuple

Native kernel binding for cast in CPU backend

Optimize creating DataFrame/struct column from a list of tuples

Potential improvements to to_pylist

Default aggregation functions delegates to Arrow Compute

Efficient kernel implementation for `IColumn.fill_null/drop_null`

← Metadata

Owner

Metadata

torcharrow torcharrow copied to clipboard

Metadata

← Metadata

Owner

Metadata

torcharrow
torcharrow copied to clipboard