torch-mlir icon indicating copy to clipboard operation
torch-mlir copied to clipboard

[Query] The refback-munge-calling-conventions pass employed in the backend from linalg to llvm, causes runtime error

Open Abhishek-TyRnT opened this issue 1 year ago • 6 comments

I was running the Conv1d_transpose op test case, lowering it to the llvm IR from linalg-from-tensors Dialect. I notice that this test is marked as crashing test case in x_fail test cases https://github.com/llvm/torch-mlir/blob/3180704b1470c047faca5fb64d285cda2a287818/projects/pt1/e2e_testing/xfail_sets.py#L50-L54

I tried to debug this issue. Upon investigating I noticed that there is an following Runtime Error from llvm

"ERROR: Runtime op verification failed\0A%9 = \22memref.cast\22(%arg0) : (memref<*xf32>) -> memref<?x?x?xf32>\0A^ rank mismatch\0ALocation: loc(unknown)"

You can see that the memref is trying to cast an unranked tensor to ranked tensor, which is not allowed in the generate-runtime- verification pass , called here

https://github.com/llvm/torch-mlir/blob/3180704b1470c047faca5fb64d285cda2a287818/projects/pt1/python/torch_mlir_e2e_test/linalg_on_tensors_backends/refbackend.py#L184-L187

The above casting operation is called after refback-munge-calling-conventions which gets rid of the ranked memrefs. My question is why is this pass necessary, this pass will clearly fail runtime verification since there is no way one could know at runtime what input we are getting and what is it's rank. Am I missing here something ? Please let me know

Abhishek-TyRnT avatar Sep 01 '24 11:09 Abhishek-TyRnT

Hi @Abhishek-TyRnT, the support for this was added here https://github.com/llvm/torch-mlir/pull/3615 by @mgehre-amd. Please refer to this PR and ask the author if you have any issues.

vivekkhandelwal1 avatar Sep 10 '24 10:09 vivekkhandelwal1

Is the question about refback-munge-calling-conventions or generate-runtime- verification?

mgehre-amd avatar Sep 10 '24 11:09 mgehre-amd

@mgehre-amd , the question was originally about refback-munge-calling-conventions, but since generate-runtime-verification was added later on, I am curious as to why that pass was added, since refback-munge-calling-conventions will eliminate rank data for the above operation, which generate-runtime-verification pass doesn't allow?

Abhishek-TyRnT avatar Sep 10 '24 15:09 Abhishek-TyRnT

generate-runtime-verification does allow memref.cast from unranked to ranked tensors, but it checks that the rank of the destination matches the dynamic rank of the input tensor (at runtime).

Otherwise the memref.cast op is invalid (see https://mlir.llvm.org/docs/Dialects/MemRef/#memrefcast-memrefcastop), which seems to be happening for this test case.

generate-runtime-verification doesn't complain about other test cases, and those also run refback-munge-calling-conventions.

I think the Conv_Transpose1d lowering needs to be fixed, but I have not looked into the details of it.

mgehre-amd avatar Sep 12 '24 07:09 mgehre-amd

generate-runtime-verification does allow memref.cast from unranked to ranked tensors, but it checks that the rank of the destination matches the dynamic rank of the input tensor (at runtime).

Otherwise the memref.cast op is invalid (see https://mlir.llvm.org/docs/Dialects/MemRef/#memrefcast-memrefcastop), which seems to be happening for this test case.

generate-runtime-verification doesn't complain about other test cases, and those also run refback-munge-calling-conventions.

I think the Conv_Transpose1d lowering needs to be fixed, but I have not looked into the details of it.

If the issue is with Conv_Transpose1d lowering then it would be better to file an issue for the same. @Abhishek-TyRnT, if you need that support then maybe file an issue.

vivekkhandelwal1 avatar Sep 13 '24 13:09 vivekkhandelwal1

I was hoping to fix it myself and send as a contribution,

Abhishek-TyRnT avatar Sep 13 '24 13:09 Abhishek-TyRnT