triton
triton copied to clipboard
Impossible to use the tutorials
Hi !
I am currently trying to understand how to use Triton with tutorials. Unfortunately, I encounter two different issues:
- for
03-matrix-multiplication.pyand06-fused-attention.py, I get:
python: /project/lib/Analysis/Utility.cpp:136: bool mlir::supportMMA(mlir::Value, int): Assertion `(version == 1 || version == 2) && "Unexpected MMA layout version found"' failed.
Aborted
The error seems to occurs at the line
tl.store(c_ptrs, c, mask=c_mask)
Since I have a GTX1080 on my computer, I work with Pascal architecture. The MMA is supported by Volta and Hopper. Nevertheless, is it possible to optimize the matmul for my GTX1080 ?
- for
05-layer-norm.py, the error is
Argument rematerialization not implemented
UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45!
Aborted
For this one, I dont have any clue...
Does someone have some thoughts on my issues?
Thanks in advance and regards, Lucas.
FP16 is not upported on pre-tensorcores GPU. Can you try FP32?
When using tl.float32, nothing changes, I get the same error
If it's a pre-Volta GPU, we don't generate the MMA layout in any means.
So perhaps we shouldn't use assert in places like:
https://github.com/openai/triton/blob/65e5a3bc24c9649d7a5e96acfc11e65bd3899fd6/lib/Analysis/Utility.cpp#L138
Feel free to modify the code and contribute.
If it's a pre-Volta GPU, we don't generate the MMA layout in any means.
Thanks ! I will take a look and see if I can find a way to avoid this issue and make a PR.
Any idea for my second issue on Argument rematerialization ?
Any idea for my second issue on Argument rematerialization ?
Not sure how this problem is triggered yet.
We don't have pre-Volta GPUs to test things out, but we can provide some guidance if you're interesting in debugging the issue. I think the main thing for layer norm would be to figure out why the codegen is any different for your 1080 than for a Volta GPU. All GPUs with compute capability <= 70 should be treated the same 🤔
Right, I see the main idea! I will give it a look but since I am a newbie in this kind of stuff, not sure I could go to deep unfortunately...
I can confirm I am also getting this issue on RTX A6000
I also encounter the issue "Argument rematerialization not implemented" when running 05-layer-norm.py on a100-80g.
Randomly (not every time) getting
Argument rematerialization not implemented
UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45!
when running a custom fused linear layer. (has activation, dropout and scaling)
edit: this was actually cuz of layernorm
Hey @Dj1312 were you able to find a fix for this issue?
Hey @ptillet, I'm trying to debug this issue on my pascal card. I have outlined my particular case in this issue https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/142.
I've swapped the following lines, note this is off of the v2.0.0 tag:
https://github.com/openai/triton/blob/bd5c2117f62c73a9e922d5e93353a39ab3ac269b/lib/Analysis/Utility.cpp#L136-L137
with the following:
if (version != 1 || version != 2)
return false;
This results in the following error:
error: cannot be converted to LLVM IR: missing `LLVMTranslationDialectInterface` registration for dialect for op: builtin.unrealized_conversion_cast
Failed to emit LLVM IR
Translate to LLVM IR failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.
Do you have any suggestions?
Hey @Dj1312 were you able to find a fix for this issue?
Unfortunately, no...
"Argument rematerialization not implemented" is probably a regression because the tutorials work for me on version 2.0.0.dev20221105 with CUDA 11.8.
Our docs build runs nightly without issues on an A100. It's possible there are some troubles on older GPUs unfortunately. I don't have any Pascal GPU I can use so it's hard for me to repro
Just to add I think people are getting this error from running pip install as that version crashes when doing
x = torch.randn(512).cuda()
ln = FusedLayerNorm(512).cuda()
y=ln(x)
l=y.sum()
l.backward() # crash
on an A100 (cuda 11.8, torch 2.0.0+cu118, triton 2.0.0) (FusedLayerNorm uses this and code from the tutorial)
Not clear to me how to get nightly without compiling the code (which if I'm understanding my compilation error correctly requires an advanced version of C++)
Nightly will be back up soon. Thanks for your patience! In the meantime recompiling the code shouldn't be too difficult
pip install triton==2.0.0.dev20230217 works on V100
I tried the tutorials on my GTX 970, and didn't get very far. I'm testing on latest main (commit dd2d5f417).
03-matrix-multiplication.py, 06-fused-attention.py, and 08-experimental-block-pointer.py (duplicate lines omitted)
error: invalid element type in packLLEElements. Expected 'f32' but got 'f16'
error: 'llvm.intr.fmuladd' op requires the same type for all operands and results
Pass execution failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.
05-layer-norm.py
Traceback (most recent call last):
File "/home/cebtenzzre/src/clones/triton/python/tutorials/05-layer-norm.py", line 367, in <module>
test_layer_norm(1151, 8192, torch.float16)
File "/home/cebtenzzre/src/clones/triton/python/tutorials/05-layer-norm.py", line 310, in test_layer_norm
y_tri.backward(dy, retain_graph=True)
File "/usr/lib/python3.11/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/usr/lib/python3.11/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/lib/python3.11/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/home/cebtenzzre/src/clones/triton/python/tutorials/05-layer-norm.py", line 281, in backward
_layer_norm_bwd_dx_fused[(M,)](dx, dy, _dw, _db, x, w, b, m, v, locks,
File "<string>", line 42, in _layer_norm_bwd_dx_fused
File "/home/cebtenzzre/src/clones/triton/python/triton/compiler/compiler.py", line 465, in compile
next_module = compile_kernel(module)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/cebtenzzre/src/clones/triton/python/triton/compiler/compiler.py", line 361, in <lambda>
lambda src: ptx_to_cubin(src, arch))
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/cebtenzzre/src/clones/triton/python/triton/compiler/compiler.py", line 160, in ptx_to_cubin
return _triton.compile_ptx_to_cubin(ptx, ptxas, arch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal Triton PTX codegen error:
ptxas /tmp/compile-ptx-src-b7492e, line 1370; error : Feature 'scopes on atomic operations' requires .target sm_60 or higher
ptxas /tmp/compile-ptx-src-b7492e, line 1466; error : Feature 'scopes on atomic operations' requires .target sm_60 or higher
ptxas fatal : Ptx assembly aborted due to errors
Is there a nightly wheel available somewhere?
I modified the code as following and it works.
# First store doesn't accumulate
if count == 0:
tl.atomic_xchg(Count, 1)
else:
# partial_dw += tl.load(DW, mask=mask)
# partial_db += tl.load(DB, mask=mask)
# ignore the condition of count == 0
partial_dw += tl.load(DW, mask=mask)
partial_db += tl.load(DB, mask=mask)
tl.store(DW, partial_dw, mask=mask)
tl.store(DB, partial_db, mask=mask
Maybe this condition triggers something.
@mikegreen7892003 That will throw an IndentationError, you either need a 'pass' in the else block or you need to comment out the else clause entirely. Also, you're missing a closing parenthesis.
tried the tutorials on my GTX 970, and didn't get very far. I'm testing on latest main (commit https://github.com/openai/triton/commit/dd2d5f417f5e40a6b6e53af8666e68565cbb823c).
error: invalid element type in packLLEElements. Expected 'f32' but got 'f16' error: 'llvm.intr.fmuladd' op requires the same type for all operands and results Pass execution failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.
@cebtenzzre I believe this is because your GPU does not support operating on float16 inputs.
Try to edit the tutorial code to use float32 instead. In the matmul tutorial you will also have to edit the autotuning configs to reduce the num_stages values and probably the group sizes to not go above the maximum shared memory limit of the hardware.
Note for triton developers: instead of crashing with a low level error message for unsupported dtypes, it would be more user friendly to raise a Python-level exception earlier with a higher level error message.
At the moment I get on a GTX 1080 TI:
loc(fused["/home/ogrisel/code/triton-sandbox/matmul.py":72:23, "/home/ogrisel/code/triton-sandbox/matmul.py":72:33]): error: invalid element type in packLLEE
lements. Expected 'f32' but got 'f16'
loc(fused["/home/ogrisel/code/triton-sandbox/matmul.py":72:23, "/home/ogrisel/code/triton-sandbox/matmul.py":72:33]): error: invalid element type in packLLEE
lements. Expected 'f32' but got 'f16'
[...] # repeated many times, then:
loc(fused["/home/ogrisel/code/triton-sandbox/matmul.py":72:23, "/home/ogrisel/code/triton-sandbox/matmul.py":72:33]): error: 'llvm.intr.fmuladd' op requires the same type for all operands and results
Pass execution failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.
Aborted (core dumped)
I am not sure how to inspect which dtypes are supported by a given device though. I had a look at: https://pytorch.org/docs/stable/cuda.html but the only think I see would be to manually map the compute capability tuple to a list of supported dtypes.