feat: add multi-axis support for xt::roll (numpy.roll parity)
Checklist
- [x] The title and commit message(s) are descriptive.
- [x] Small commits made to fix your PR have been squashed to avoid history pollution.
- [x] Tests have been added for new features or bug fixes.
- [x] API of new functions and classes are documented.
Description
This PR extends the xt::roll function to support multiple axes simultaneously, completing the NumPy-compatible API as originally requested.
Related: #1766, #1823
The original implementation in #1823 only supported single-axis roll operations. According to NumPy's roll signature, both shift and axis parameters can be tuples of ints:
numpy.roll(a, shift, axis=None)
# shift : int or tuple of ints
# axis : int or tuple of ints, optional
This PR adds the missing multi-axis support:
xt::xarray<int> a = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}};
auto result = xt::roll(a, {1, 2}, {0, 1}); // roll 1 on axis 0, roll 2 on axis 1
Commits
1. feat: add multi-axis support for xt::roll with optimized pointer arithmetic
- Implemented
detail::roll_multi()using recursive pointer arithmetic algorithm extended from the existing single-axis implementation - Added two new
xt::roll()overloads: one for containers (std::vector,std::array) and one for C-style arrays - Supports accumulated shifts when the same axis appears multiple times (NumPy-compatible behavior)
- Supports negative axis indices
- Added comprehensive unit tests covering various scenarios (2D/3D, negative shifts, different container types, column-major layout, etc.)
2. docs: add multi-axis roll examples to documentation
- Updated
docs/source/numpy.rstwith NumPy-to-xtensor comparison for multi-axis roll - Updated
docs/source/quickref/manipulation.rstwith usage examples
3. bench: add benchmark for multi-axis roll
- Added
benchmark/benchmark_roll.cppwith comprehensive benchmarks comparing sequential single-axis rolls vs. the new multi-axis implementation - Benchmarks cover 2D to 5D tensors, various sizes, and realistic image processing scenarios (RGB 1080p to 8K)
Build & Test
# Run unit tests
mkdir build && cd build
cmake .. -DBUILD_TESTS=ON -DDOWNLOAD_GTEST=ON
make -j8 test_xmanipulation
./test/test_xmanipulation --test-case="*roll*" -s
# Run benchmarks
cd benchmark
mkdir build && cd build
cmake .. -DDOWNLOAD_GBENCHMARK=ON -Dxtensor_DIR=$(realpath ../../build)
make -j8 benchmark_xtensor
./benchmark_xtensor --benchmark_filter=roll
📊 Benchmark Results (click to expand)
Test Environment: Intel Core i7-12700H, 16GB RAM, Ubuntu 24.04.3 LTS
Run on (20 X 4600 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x10)
L1 Instruction 32 KiB (x10)
L2 Unified 1280 KiB (x10)
L3 Unified 24576 KiB (x1)
---------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
detail::roll_3d_3axes_sequential/128/r0.01 11735715 ns 11731715 ns 51 bytes_per_second=1.33Gi/s
detail::roll_3d_3axes_multi/128/r0.01 1184996 ns 1184922 ns 574 bytes_per_second=13.19Gi/s
detail::roll_3d_3axes_sequential/128/r0.3 12288778 ns 12288030 ns 56 bytes_per_second=1.27Gi/s
detail::roll_3d_3axes_multi/128/r0.3 1267342 ns 1267286 ns 548 bytes_per_second=12.33Gi/s
detail::roll_2d_sequential/64x64 1841 ns 1841 ns 382473 bytes_per_second=16.57Gi/s
detail::roll_2d_multi/64x64 854 ns 854 ns 821232 bytes_per_second=35.74Gi/s
detail::roll_2d_sequential/256x256 30787 ns 30784 ns 22481 bytes_per_second=15.86Gi/s
detail::roll_2d_multi/256x256 12725 ns 12724 ns 53445 bytes_per_second=38.37Gi/s
detail::roll_2d_sequential/1024x1024 963154 ns 963069 ns 736 bytes_per_second=8.11Gi/s
detail::roll_2d_multi/1024x1024 419272 ns 419185 ns 1857 bytes_per_second=18.64Gi/s
detail::roll_3d_2axes_sequential/128x128x128 10321186 ns 10321178 ns 66 bytes_per_second=1.51Gi/s
detail::roll_3d_2axes_multi/128x128x128 1200420 ns 1200325 ns 561 bytes_per_second=13.02Gi/s
detail::roll_4d_4axes_sequential/16 95978 ns 95957 ns 7380 bytes_per_second=5.09Gi/s
detail::roll_4d_4axes_multi/16 26161 ns 26162 ns 26964 bytes_per_second=18.66Gi/s
detail::roll_4d_4axes_sequential/32 10419965 ns 10419994 ns 67 bytes_per_second=767.76Mi/s
detail::roll_4d_4axes_multi/32 422750 ns 422731 ns 1665 bytes_per_second=18.48Gi/s
detail::roll_5d_5axes_sequential/16 11791085 ns 11790205 ns 61 bytes_per_second=678.53Mi/s
detail::roll_5d_5axes_multi/16 508597 ns 508588 ns 1388 bytes_per_second=15.36Gi/s
detail::roll_5d_5axes_sequential/32 482957914 ns 482920008 ns 2 bytes_per_second=530.11Mi/s
detail::roll_5d_5axes_multi/32 100002192 ns 99992129 ns 7 bytes_per_second=2.50Gi/s
detail::roll_3d_2axes_sequential/rgb_1080p 36946640 ns 36943762 ns 19 bytes_per_second=1.25Gi/s
detail::roll_3d_2axes_multi/rgb_1080p 17183395 ns 17182811 ns 39 bytes_per_second=2.70Gi/s
detail::roll_3d_2axes_sequential/rgb_4K 158555547 ns 158541395 ns 4 bytes_per_second=1.17Gi/s
detail::roll_3d_2axes_multi/rgb_4K 75991080 ns 75982624 ns 9 bytes_per_second=2.44Gi/s
detail::roll_3d_2axes_sequential/rgb_8K 606234407 ns 606221933 ns 1 bytes_per_second=1.22Gi/s
detail::roll_3d_2axes_multi/rgb_8K 294193380 ns 294178950 ns 2 bytes_per_second=2.52Gi/s
Performance Summary
The multi-axis implementation achieves significant speedups over sequential single-axis roll calls:
| Scenario | Speedup |
|---|---|
| 2D tensors | ~2-3x |
| 3D tensors (2-3 axes) | ~2-10x |
| 4D tensors (4 axes) | ~4-25x |
| 5D tensors (5 axes) | ~5-20x |
| RGB images (1080p-8K) | ~2x |
The performance gain increases with the number of axes being rolled, as the multi-axis version avoids creating intermediate temporary arrays and performs the roll operation in a single pass through the data.