triton Test case from multi-headed self-attention tutorial fails

~~I am trying to run the test_op pytest on the fused attention tutorial (https://triton-lang.org/master/getting-started/tutorials/06-fused-attention.html) on a A100 with CUDA 11.4. The error is:~~

~~std::vector::reference std::vector<unsigned int>::operator[](std::vector::size_type) [_Tp = unsigned int, _Alloc = std::allocator<unsigned int>]: Assertion '__n < this->size()' failed~~

~~I tried applying the changes from this issue, but it did not help. I can make the error go away by applying this change:~~

  layout.cc:160
  + for (unsigned o : order_) {
  +  if (o >= max_contiguous.size()) {
  +    return;
  +  }
  + }
  if(max_contiguous.size() > 0){
    std::sort(order_.begin(), order_.end(), [&](unsigned a, unsigned b) {
      return max_contiguous[a] > max_contiguous[b];
    });

~~This change allows the test case to proceed without raising an error. However, the outputs of the self-attention are incorrect after applying this change:~~

I'm no longer seeing problems with the vector access, even after removing the change. However, it seems like there are some differences in the outputs of the triton kernel and the pytorch implementation:

  File "....numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 2 decimals

Mismatched elements: 397466 / 786432 (50.5%)
Max absolute difference: 1.103
Max relative difference: inf
 x: array([[[[-3.95e-01,  2.08e+00,  4.46e-01, ..., -1.49e+00,  9.86e-02,
           2.61e-01],
         [-1.43e-01,  9.78e-01,  5.16e-01, ..., -9.55e-01,  6.51e-01,...
 y: array([[[[-2.79e-01,  2.30e+00,  8.63e-01, ..., -1.17e+00,  8.92e-02,
           3.55e-01],
         [-9.31e-04,  1.32e+00,  5.63e-01, ..., -1.09e+00,  7.93e-01,...

I examined the output, and it seems like the differences in the 2 outputs are pretty small. If you compare using:

torch.isclose(ref_out, tri_out, rtol=0.01, atol=0.001).all(), you can get the outputs to match. However, the gradients of the model don't seem to be close. Have you tried to train a neural network on the tutorial implementation? Can it get similar accuracy compared to the pytorch implementation?

Nov 30 '22 15:11 Lucy7298

hmm, curious!

Nov 30 '22 20:11 murphymatt

@Lucy7298 which pytorch implementation are you using? Just brute force computation of the attention output? The triton impl is using float16 so would expect diffs if you are comparing against a float32 implementation

Dec 08 '22 16:12 CHDev93

triton triton copied to clipboard

Test case from multi-headed self-attention tutorial fails

triton
triton copied to clipboard