xtensor icon indicating copy to clipboard operation
xtensor copied to clipboard

[WIP] Add implementation to forward linear iterators in strided views

Open spectre-ns opened this issue 1 year ago • 6 comments

Checklist

  • [ ] The title and commit message(s) are descriptive.
  • [ ] Small commits made to fix your PR have been squashed to avoid history pollution.
  • [ ] Tests have been added for new features or bug fixes.
  • [ ] API of new functions and classes are documented.

Description

spectre-ns avatar Jul 11 '24 00:07 spectre-ns

@JohanMabille I'm trying to forward linear iterators where possible but I think the implementations expect steppers in certain places and there are some subtle differences I haven't figured out yet. I thought it was weird that the linear methods on the strided view actually return xiterators of steppers rather than linear iterators as the interface would suggest.

Test bench from issue #2734

Without Changes: case 1: 10.6235ms case 2a: 48.4154ms case 2b: 8.93694ms case 2c: 48.7691ms case 3a: 47.3316ms case 3b: 9.05212ms case 3c: 47.3312ms

With changes results in: case 1: 10.0266ms case 2a: 9.23594ms case 2b: 8.97092ms case 2c: 9.09802ms case 3a: 8.86476ms case 3b: 9.55695ms case 3c: 8.67326ms

The implementation is super rough but I wanted to show you the basic idea. I ran into a pesky issue with reshaped views that resulted in my having to force a fallback to the old implementation. Might need some help there but the performance improvements speak for themselves.

spectre-ns avatar Jul 11 '24 23:07 spectre-ns

Can a linear assignment be used when the inner stride is not one? It seems like the linear iterators increment by one but when doing conversion from row major to column major the strides between adjacent elements are non-unity.

spectre-ns avatar Jul 12 '24 12:07 spectre-ns

Can a linear assignment be used when the inner stride is not one? It seems like the linear iterators increment by one but when doing conversion from row major to column major the strides between adjacent elements are non-unity.

This issue here was trying to iterate on the object like a column major when the internal storage is row major. That requires a stepper... In these cases the implementation falls back to a stepper even though the underlying expression might be linearly iterable.

spectre-ns avatar Jul 12 '24 23:07 spectre-ns

See benchmark code below:

#include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>
#include <xtensor/xnoalias.hpp>

double mean_milliseconds_from_total(std::chrono::nanoseconds total,
    size_t num_repeats) {
    std::chrono::duration<double, std::milli> total_ms = total;
    return total_ms.count() / (double)num_repeats;
}

int main() {
    size_t num_repeats = 100;
    xt::xtensor<double, 1> a = xt::random::rand<double>({ 10000000 });
    xt::xtensor<double, 1> b = xt::random::rand<double>({ 10000000 });
    xt::xtensor<double, 1> c = xt::zeros<double>({ 10000000 });

    // case 1: full tensor
    auto started = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < num_repeats; ++i)
        xt::noalias(c) = a + b;
    auto finished = std::chrono::high_resolution_clock::now();
    std::cout << "case 1:  "
        << mean_milliseconds_from_total(finished - started, num_repeats)
        << "ms" << std::endl;

    // case 2a: view of tensor and expression with xt::all()
    started = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < num_repeats; ++i)
        xt::noalias(xt::strided_view(c, { xt::all() })) = xt::strided_view(a + b, { xt::all() });
    finished = std::chrono::high_resolution_clock::now();
    std::cout << "case 2a: "
        << mean_milliseconds_from_total(finished - started, num_repeats)
        << "ms" << std::endl;

    // case 2b: view of only tensor with xt::all()
    started = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < num_repeats; ++i)
        xt::noalias(xt::strided_view(c, { xt::all() })) = a + b;
    finished = std::chrono::high_resolution_clock::now();
    std::cout << "case 2b: "
        << mean_milliseconds_from_total(finished - started, num_repeats)
        << "ms" << std::endl;

    // case 2c: view of only expression with xt::all()
    started = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < num_repeats; ++i)
        xt::noalias(c) = xt::strided_view(a + b, { xt::all() });
    finished = std::chrono::high_resolution_clock::now();
    std::cout << "case 2c: "
        << mean_milliseconds_from_total(finished - started, num_repeats)
        << "ms" << std::endl;

    // case 3a: view of tensor and expression with xt::range()
    started = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < num_repeats; ++i)
        xt::noalias(xt::strided_view(c, { xt::range(0, c.size()) })) =
        xt::strided_view(a + b, { xt::range(0, c.size()) });
    finished = std::chrono::high_resolution_clock::now();
    std::cout << "case 3a: "
        << mean_milliseconds_from_total(finished - started, num_repeats)
        << "ms" << std::endl;

    // case 3b: view of only tensor with xt::range()
    started = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < num_repeats; ++i)
        xt::noalias(xt::strided_view(c, { xt::range(0, c.size()) })) = a + b;
    finished = std::chrono::high_resolution_clock::now();
    std::cout << "case 3b: "
        << mean_milliseconds_from_total(finished - started, num_repeats)
        << "ms" << std::endl;

    // case 3c: view of only expression with xt::range()
    started = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < num_repeats; ++i)
        xt::noalias(c) = xt::strided_view(a + b, { xt::range(0, c.size()) });
    finished = std::chrono::high_resolution_clock::now();
    std::cout << "case 3c: "
        << mean_milliseconds_from_total(finished - started, num_repeats)
        << "ms" << std::endl;
    return 0;
}

spectre-ns avatar Jul 13 '24 00:07 spectre-ns

Thanks a lot for tackling this! I think the linear / not linear iteration was supposed to be handled in the flat_storage_adaptor but we probably missed some cases.

I wonder if it would make sense to merge the new linear_flat_expression_adaptor with the existing flat_expression_adaptor. This would put the reponsibility of choosing between the linear iterator and the regular one in the flat adaptor rather than in the strided_view. Also It would probably benefit other classes using the flat_expression adaptor.

@JohanMabille I debated that as well. Let me play around with the logic and see if I can make this work without modifying the strided_view class and instead us the flat_expression_adaptor implementation. That would likely be cleaner. I was hoping to demonstrate the type of performance improvements before coming up with the final solution.

spectre-ns avatar Jul 19 '24 10:07 spectre-ns

Thanks a lot for tackling this! I think the linear / not linear iteration was supposed to be handled in the flat_storage_adaptor but we probably missed some cases.

I wonder if it would make sense to merge the new linear_flat_expression_adaptor with the existing flat_expression_adaptor. This would put the reponsibility of choosing between the linear iterator and the regular one in the flat adaptor rather than in the strided_view. Also It would probably benefit other classes using the flat_expression adaptor.

@JohanMabille combining them became difficult because the uvector implementation doesn't have the linear_iterator interface. The begin and end methods are linear iterators but the interface is not consistent. I'm not sure if this is intentional or not.

spectre-ns avatar Jul 21 '24 01:07 spectre-ns