torcharrow
torcharrow copied to clipboard
High performance model preprocessing library on PyTorch
It's convention to have submodules in root level `thrid_party` directory (e.g. consider `PyTorch` or `TorchAudio`). For historic reason TorchArrow putting it at `csrc/velox`: https://github.com/facebookresearch/torcharrow/tree/main/csrc/velox
``` [670/694] Linking CXX shared module csrc/velox/_torcharrow.cpython-38-darwin.so ld: warning: direct access in function 'facebook::torcharrow::declareArrayType(pybind11::module_&)' from file 'csrc/velox/CMakeFiles/_torcharrow.dir/lib.cpp.o' to global weak symbol 'facebook::velox::ArrayType::elementType() const' from file 'csrc/velox/velox/velox/functions/prestosql/CMakeFiles/velox_functions_prestosql.dir/ArrayContains.cpp.o' means the weak symbol...
This is an automated pull request to update the first-party submodule for [facebookincubator/velox](https://github.com/facebookincubator/velox). New submodule commit: https://github.com/facebookincubator/velox/commit/abe7604bcd66f4fa96c5d21a643d0261efe07a8c Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Hi, I noticed that some data preprocessing operations used in recommendation systems like `bucketize, sigridHash, and firstX` are implemented in: [torcharrow/tree/main/csrc/velox/functions/rec](https://github.com/pytorch/torcharrow/tree/main/csrc/velox/functions/rec) I would like to ask if other preprocessing operations...
Hi, This looks like a really interesting project! I saw currently torcharrow runs on CPU with the Velox backend. Just wondering any plan to offload some of the ops to...
Summary: Tests and benchmarks targets are now de-coupled. That means they can be built independently. Shared functionality is moved to a common utility library. Resolves https://github.com/facebookincubator/velox/issues/1704 X-link: https://github.com/facebookincubator/velox/pull/2439 Reviewed By:...
I`m asking for myself, and also my algo team members in company. Currently we got PB level of data, which is separated in parquets across different remote hdfs paths (per...
`save-state` and `set-output` commands used in GitHub Actions are deprecated and [GitHub recommends using environment files](https://github.blog/changelog/2023-07-24-github-actions-update-on-save-state-and-set-output-commands/). This PR updates the usage of `set-output` to `$GITHUB_OUTPUT` Instructions for envvar usage from...
### 🚀 The feature, motivation and pitch We're working on supporting bf16 in [lance format](https://github.com/lancedb/lance), which will be presented as a bf16 extension type in Arrow (see PR for details:...
I want to use datapipe to read parquet files in which image is stored as binary. But I got error: ``` NotImplementedError: Unsupported Arrow type: binary ``` So I wonder...