feat: add multi-axis support for xt::roll (numpy.roll parity)

Open f14XuanLv opened this issue 1 month ago • 0 comments

Checklist

[x] The title and commit message(s) are descriptive.
[x] Small commits made to fix your PR have been squashed to avoid history pollution.
[x] Tests have been added for new features or bug fixes.
[x] API of new functions and classes are documented.

Description

This PR extends the xt::roll function to support multiple axes simultaneously, completing the NumPy-compatible API as originally requested.

Related: #1766, #1823

The original implementation in #1823 only supported single-axis roll operations. According to NumPy's roll signature, both shift and axis parameters can be tuples of ints:

numpy.roll(a, shift, axis=None)
# shift : int or tuple of ints
# axis : int or tuple of ints, optional

This PR adds the missing multi-axis support:

xt::xarray<int> a = {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}};
auto result = xt::roll(a, {1, 2}, {0, 1});  // roll 1 on axis 0, roll 2 on axis 1

Commits

1. `feat: add multi-axis support for xt::roll with optimized pointer arithmetic`

Implemented detail::roll_multi() using recursive pointer arithmetic algorithm extended from the existing single-axis implementation
Added two new xt::roll() overloads: one for containers (std::vector, std::array) and one for C-style arrays
Supports accumulated shifts when the same axis appears multiple times (NumPy-compatible behavior)
Supports negative axis indices
Added comprehensive unit tests covering various scenarios (2D/3D, negative shifts, different container types, column-major layout, etc.)

2. `docs: add multi-axis roll examples to documentation`

Updated docs/source/numpy.rst with NumPy-to-xtensor comparison for multi-axis roll
Updated docs/source/quickref/manipulation.rst with usage examples

3. `bench: add benchmark for multi-axis roll`

Added benchmark/benchmark_roll.cpp with comprehensive benchmarks comparing sequential single-axis rolls vs. the new multi-axis implementation
Benchmarks cover 2D to 5D tensors, various sizes, and realistic image processing scenarios (RGB 1080p to 8K)

Build & Test

# Run unit tests
mkdir build && cd build
cmake .. -DBUILD_TESTS=ON -DDOWNLOAD_GTEST=ON
make -j8 test_xmanipulation
./test/test_xmanipulation --test-case="*roll*" -s

# Run benchmarks
cd benchmark
mkdir build && cd build
cmake .. -DDOWNLOAD_GBENCHMARK=ON -Dxtensor_DIR=$(realpath ../../build)
make -j8 benchmark_xtensor
./benchmark_xtensor --benchmark_filter=roll

📊 Benchmark Results (click to expand)

Test Environment: Intel Core i7-12700H, 16GB RAM, Ubuntu 24.04.3 LTS

Run on (20 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)

---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
detail::roll_3d_3axes_sequential/128/r0.01       11735715 ns     11731715 ns           51 bytes_per_second=1.33Gi/s
detail::roll_3d_3axes_multi/128/r0.01             1184996 ns      1184922 ns          574 bytes_per_second=13.19Gi/s
detail::roll_3d_3axes_sequential/128/r0.3        12288778 ns     12288030 ns           56 bytes_per_second=1.27Gi/s
detail::roll_3d_3axes_multi/128/r0.3              1267342 ns      1267286 ns          548 bytes_per_second=12.33Gi/s
detail::roll_2d_sequential/64x64                     1841 ns         1841 ns       382473 bytes_per_second=16.57Gi/s
detail::roll_2d_multi/64x64                           854 ns          854 ns       821232 bytes_per_second=35.74Gi/s
detail::roll_2d_sequential/256x256                  30787 ns        30784 ns        22481 bytes_per_second=15.86Gi/s
detail::roll_2d_multi/256x256                       12725 ns        12724 ns        53445 bytes_per_second=38.37Gi/s
detail::roll_2d_sequential/1024x1024               963154 ns       963069 ns          736 bytes_per_second=8.11Gi/s
detail::roll_2d_multi/1024x1024                    419272 ns       419185 ns         1857 bytes_per_second=18.64Gi/s
detail::roll_3d_2axes_sequential/128x128x128     10321186 ns     10321178 ns           66 bytes_per_second=1.51Gi/s
detail::roll_3d_2axes_multi/128x128x128           1200420 ns      1200325 ns          561 bytes_per_second=13.02Gi/s
detail::roll_4d_4axes_sequential/16                 95978 ns        95957 ns         7380 bytes_per_second=5.09Gi/s
detail::roll_4d_4axes_multi/16                      26161 ns        26162 ns        26964 bytes_per_second=18.66Gi/s
detail::roll_4d_4axes_sequential/32              10419965 ns     10419994 ns           67 bytes_per_second=767.76Mi/s
detail::roll_4d_4axes_multi/32                     422750 ns       422731 ns         1665 bytes_per_second=18.48Gi/s
detail::roll_5d_5axes_sequential/16              11791085 ns     11790205 ns           61 bytes_per_second=678.53Mi/s
detail::roll_5d_5axes_multi/16                     508597 ns       508588 ns         1388 bytes_per_second=15.36Gi/s
detail::roll_5d_5axes_sequential/32             482957914 ns    482920008 ns            2 bytes_per_second=530.11Mi/s
detail::roll_5d_5axes_multi/32                  100002192 ns     99992129 ns            7 bytes_per_second=2.50Gi/s
detail::roll_3d_2axes_sequential/rgb_1080p       36946640 ns     36943762 ns           19 bytes_per_second=1.25Gi/s
detail::roll_3d_2axes_multi/rgb_1080p            17183395 ns     17182811 ns           39 bytes_per_second=2.70Gi/s
detail::roll_3d_2axes_sequential/rgb_4K         158555547 ns    158541395 ns            4 bytes_per_second=1.17Gi/s
detail::roll_3d_2axes_multi/rgb_4K               75991080 ns     75982624 ns            9 bytes_per_second=2.44Gi/s
detail::roll_3d_2axes_sequential/rgb_8K         606234407 ns    606221933 ns            1 bytes_per_second=1.22Gi/s
detail::roll_3d_2axes_multi/rgb_8K              294193380 ns    294178950 ns            2 bytes_per_second=2.52Gi/s

Performance Summary

The multi-axis implementation achieves significant speedups over sequential single-axis roll calls:

Scenario	Speedup
2D tensors	~2-3x
3D tensors (2-3 axes)	~2-10x
4D tensors (4 axes)	~4-25x
5D tensors (5 axes)	~5-20x
RGB images (1080p-8K)	~2x

The performance gain increases with the number of axes being rolled, as the multi-axis version avoids creating intermediate temporary arrays and performs the roll operation in a single pass through the data.

Dec 03 '25 11:12 f14XuanLv

feat: add multi-axis support for xt::roll (numpy.roll parity)

Checklist

Description

Commits

1. feat: add multi-axis support for xt::roll with optimized pointer arithmetic

2. docs: add multi-axis roll examples to documentation

3. bench: add benchmark for multi-axis roll

Build & Test

Performance Summary

1. `feat: add multi-axis support for xt::roll with optimized pointer arithmetic`

2. `docs: add multi-axis roll examples to documentation`

3. `bench: add benchmark for multi-axis roll`