Yunjie
Yunjie
Repro(although may not be a minimal one): [conv_relu_conv_relu_float16.py](https://gist.github.com/pyjhzwh/d9666e36ec7bd7963a0252ddb9351fbc#file-conv_relu_conv_relu_float16-py), [conv_relu_conv_relu_float32.py](https://gist.github.com/pyjhzwh/d9666e36ec7bd7963a0252ddb9351fbc#file-conv_relu_conv_relu_float32-py) `call()` does the forward computation of ``` torch.nn.Sequential( nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2), nn.ReLU(inplace=True), nn.Conv2d(64, 192, kernel_size=5, padding=2), nn.ReLU(inplace=True), ) ```...
repro: https://gist.github.com/pyjhzwh/a19de7882aff600ee4472398b3017758 kernel 0 basically do matmul, then multiply the results by 1.0, the store it back to output buffer. buf1 is the output buffer of kernel0 given the config...
repro: https://gist.github.com/pyjhzwh/2ba871a53c2eac6575948467317bafa1 ``` matrix_x00 = tl.load(x00_ptrs, mask=mask_x00, other=0.) matrix_x01 = tl.load(x01_ptrs, mask=mask_x01, other=0.) ``` where x00_ptrs and x01_ptrs are the same, mask_x00 and mask_x01 are the same. But it will...
`map::at error` Repro: ``` import torch import triton from torch import empty_strided, as_strided import triton.language as tl @triton.jit def kernel0(in_ptr0, out_ptr0, ks0, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):...