triton icon indicating copy to clipboard operation
triton copied to clipboard

Impossible to use the tutorials

Open lucasgrjn opened this issue 2 years ago • 26 comments

Hi !

I am currently trying to understand how to use Triton with tutorials. Unfortunately, I encounter two different issues:

  • for 03-matrix-multiplication.py and 06-fused-attention.py, I get:
python: /project/lib/Analysis/Utility.cpp:136: bool mlir::supportMMA(mlir::Value, int): Assertion `(version == 1 || version == 2) && "Unexpected MMA layout version found"' failed.
Aborted

The error seems to occurs at the line

tl.store(c_ptrs, c, mask=c_mask)

Since I have a GTX1080 on my computer, I work with Pascal architecture. The MMA is supported by Volta and Hopper. Nevertheless, is it possible to optimize the matmul for my GTX1080 ?

  • for 05-layer-norm.py, the error is
Argument rematerialization not implemented

UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45!
Aborted

For this one, I dont have any clue...

Does someone have some thoughts on my issues?

Thanks in advance and regards, Lucas.

lucasgrjn avatar Mar 02 '23 22:03 lucasgrjn

FP16 is not upported on pre-tensorcores GPU. Can you try FP32?

ptillet avatar Mar 02 '23 22:03 ptillet

When using tl.float32, nothing changes, I get the same error

lucasgrjn avatar Mar 02 '23 22:03 lucasgrjn

If it's a pre-Volta GPU, we don't generate the MMA layout in any means.

So perhaps we shouldn't use assert in places like:

https://github.com/openai/triton/blob/65e5a3bc24c9649d7a5e96acfc11e65bd3899fd6/lib/Analysis/Utility.cpp#L138

Feel free to modify the code and contribute.

Jokeren avatar Mar 02 '23 23:03 Jokeren

If it's a pre-Volta GPU, we don't generate the MMA layout in any means.

Thanks ! I will take a look and see if I can find a way to avoid this issue and make a PR.

Any idea for my second issue on Argument rematerialization ?

lucasgrjn avatar Mar 03 '23 10:03 lucasgrjn

Any idea for my second issue on Argument rematerialization ?

Not sure how this problem is triggered yet.

Jokeren avatar Mar 03 '23 16:03 Jokeren

We don't have pre-Volta GPUs to test things out, but we can provide some guidance if you're interesting in debugging the issue. I think the main thing for layer norm would be to figure out why the codegen is any different for your 1080 than for a Volta GPU. All GPUs with compute capability <= 70 should be treated the same 🤔

ptillet avatar Mar 03 '23 16:03 ptillet

Right, I see the main idea! I will give it a look but since I am a newbie in this kind of stuff, not sure I could go to deep unfortunately...

lucasgrjn avatar Mar 05 '23 15:03 lucasgrjn

I can confirm I am also getting this issue on RTX A6000

andreicozma1 avatar Mar 06 '23 21:03 andreicozma1

I also encounter the issue "Argument rematerialization not implemented" when running 05-layer-norm.py on a100-80g.

s-JoL avatar Mar 24 '23 11:03 s-JoL

Randomly (not every time) getting

Argument rematerialization not implemented
UNREACHABLE executed at /project/lib/Dialect/TritonGPU/Transforms/TritonGPUConversion.cpp:45!

when running a custom fused linear layer. (has activation, dropout and scaling)

edit: this was actually cuz of layernorm

RuABraun avatar Mar 28 '23 01:03 RuABraun

Hey @Dj1312 were you able to find a fix for this issue?

clxyder avatar Apr 08 '23 14:04 clxyder

Hey @ptillet, I'm trying to debug this issue on my pascal card. I have outlined my particular case in this issue https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/142.

I've swapped the following lines, note this is off of the v2.0.0 tag:

https://github.com/openai/triton/blob/bd5c2117f62c73a9e922d5e93353a39ab3ac269b/lib/Analysis/Utility.cpp#L136-L137

with the following:

if (version != 1 || version != 2)
    return false;

This results in the following error:

error: cannot be converted to LLVM IR: missing `LLVMTranslationDialectInterface` registration for dialect for op: builtin.unrealized_conversion_cast
Failed to emit LLVM IR
Translate to LLVM IR failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.

Do you have any suggestions?

clxyder avatar Apr 11 '23 04:04 clxyder

Hey @Dj1312 were you able to find a fix for this issue?

Unfortunately, no...

lucasgrjn avatar Apr 11 '23 12:04 lucasgrjn

"Argument rematerialization not implemented" is probably a regression because the tutorials work for me on version 2.0.0.dev20221105 with CUDA 11.8.

vmarkovtsev avatar May 10 '23 11:05 vmarkovtsev

Our docs build runs nightly without issues on an A100. It's possible there are some troubles on older GPUs unfortunately. I don't have any Pascal GPU I can use so it's hard for me to repro

ptillet avatar May 11 '23 06:05 ptillet

Just to add I think people are getting this error from running pip install as that version crashes when doing

x = torch.randn(512).cuda()
ln = FusedLayerNorm(512).cuda()
y=ln(x)
l=y.sum()
l.backward()  # crash

on an A100 (cuda 11.8, torch 2.0.0+cu118, triton 2.0.0) (FusedLayerNorm uses this and code from the tutorial)

Not clear to me how to get nightly without compiling the code (which if I'm understanding my compilation error correctly requires an advanced version of C++)

RuABraun avatar May 12 '23 04:05 RuABraun

Nightly will be back up soon. Thanks for your patience! In the meantime recompiling the code shouldn't be too difficult

ptillet avatar May 13 '23 05:05 ptillet

pip install triton==2.0.0.dev20230217 works on V100

cszipper avatar May 16 '23 12:05 cszipper

I tried the tutorials on my GTX 970, and didn't get very far. I'm testing on latest main (commit dd2d5f417).

03-matrix-multiplication.py, 06-fused-attention.py, and 08-experimental-block-pointer.py (duplicate lines omitted)

error: invalid element type in packLLEElements. Expected 'f32' but got 'f16'
error: 'llvm.intr.fmuladd' op requires the same type for all operands and results
Pass execution failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.

05-layer-norm.py

Traceback (most recent call last):
  File "/home/cebtenzzre/src/clones/triton/python/tutorials/05-layer-norm.py", line 367, in <module>
    test_layer_norm(1151, 8192, torch.float16)
  File "/home/cebtenzzre/src/clones/triton/python/tutorials/05-layer-norm.py", line 310, in test_layer_norm
    y_tri.backward(dy, retain_graph=True)
  File "/usr/lib/python3.11/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/lib/python3.11/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/lib/python3.11/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/cebtenzzre/src/clones/triton/python/tutorials/05-layer-norm.py", line 281, in backward
    _layer_norm_bwd_dx_fused[(M,)](dx, dy, _dw, _db, x, w, b, m, v, locks,
  File "<string>", line 42, in _layer_norm_bwd_dx_fused
  File "/home/cebtenzzre/src/clones/triton/python/triton/compiler/compiler.py", line 465, in compile
    next_module = compile_kernel(module)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cebtenzzre/src/clones/triton/python/triton/compiler/compiler.py", line 361, in <lambda>
    lambda src: ptx_to_cubin(src, arch))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cebtenzzre/src/clones/triton/python/triton/compiler/compiler.py", line 160, in ptx_to_cubin
    return _triton.compile_ptx_to_cubin(ptx, ptxas, arch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal Triton PTX codegen error: 
ptxas /tmp/compile-ptx-src-b7492e, line 1370; error   : Feature 'scopes on atomic operations' requires .target sm_60 or higher
ptxas /tmp/compile-ptx-src-b7492e, line 1466; error   : Feature 'scopes on atomic operations' requires .target sm_60 or higher
ptxas fatal   : Ptx assembly aborted due to errors

cebtenzzre avatar Jun 02 '23 19:06 cebtenzzre

Is there a nightly wheel available somewhere?

RuABraun avatar Jul 06 '23 14:07 RuABraun

I modified the code as following and it works.

# First store doesn't accumulate
if count == 0:
    tl.atomic_xchg(Count, 1)
else:
    # partial_dw += tl.load(DW, mask=mask)
    # partial_db += tl.load(DB, mask=mask)

# ignore the condition of count == 0 
partial_dw += tl.load(DW, mask=mask)
partial_db += tl.load(DB, mask=mask)

tl.store(DW, partial_dw, mask=mask)
tl.store(DB, partial_db, mask=mask

Maybe this condition triggers something.

mikegreen7892003 avatar Aug 08 '23 04:08 mikegreen7892003

@mikegreen7892003 That will throw an IndentationError, you either need a 'pass' in the else block or you need to comment out the else clause entirely. Also, you're missing a closing parenthesis.

cebtenzzre avatar Aug 08 '23 14:08 cebtenzzre

tried the tutorials on my GTX 970, and didn't get very far. I'm testing on latest main (commit https://github.com/openai/triton/commit/dd2d5f417f5e40a6b6e53af8666e68565cbb823c).

error: invalid element type in packLLEElements. Expected 'f32' but got 'f16' error: 'llvm.intr.fmuladd' op requires the same type for all operands and results Pass execution failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.

@cebtenzzre I believe this is because your GPU does not support operating on float16 inputs.

Try to edit the tutorial code to use float32 instead. In the matmul tutorial you will also have to edit the autotuning configs to reduce the num_stages values and probably the group sizes to not go above the maximum shared memory limit of the hardware.

Note for triton developers: instead of crashing with a low level error message for unsupported dtypes, it would be more user friendly to raise a Python-level exception earlier with a higher level error message.

At the moment I get on a GTX 1080 TI:

loc(fused["/home/ogrisel/code/triton-sandbox/matmul.py":72:23, "/home/ogrisel/code/triton-sandbox/matmul.py":72:33]): error: invalid element type in packLLEE
lements. Expected 'f32' but got 'f16'
loc(fused["/home/ogrisel/code/triton-sandbox/matmul.py":72:23, "/home/ogrisel/code/triton-sandbox/matmul.py":72:33]): error: invalid element type in packLLEE
lements. Expected 'f32' but got 'f16'
[...]  # repeated many times, then:
loc(fused["/home/ogrisel/code/triton-sandbox/matmul.py":72:23, "/home/ogrisel/code/triton-sandbox/matmul.py":72:33]): error: 'llvm.intr.fmuladd' op requires the same type for all operands and results
Pass execution failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.
Aborted (core dumped)

I am not sure how to inspect which dtypes are supported by a given device though. I had a look at: https://pytorch.org/docs/stable/cuda.html but the only think I see would be to manually map the compute capability tuple to a list of supported dtypes.

ogrisel avatar Dec 12 '23 07:12 ogrisel