onnx-mlir icon indicating copy to clipboard operation
onnx-mlir copied to clipboard

Core dumped when compiling GPT2

Open justinchuby opened this issue 1 year ago • 17 comments

./onnx-mlir --EmitONNXBasic /home/justinchu/dev/onnx/gpt2-dataprop.onnx 
/usr/include/c++/11/bits/stl_vector.h:1045: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = mlir::Type; _Alloc = std::allocator<mlir::Type>; std::vector<_Tp, _Alloc>::reference = mlir::Type&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__n < this->size()' failed.

I tested with other --Emit options and got the same error.

ONNX Model gpt2-dataprop.zip

justinchuby avatar May 03 '23 23:05 justinchuby

Is the model downloaded from onnx model zoo?

chentong319 avatar May 04 '23 17:05 chentong319

It is a torch exported model I experimented with.

justinchuby avatar May 04 '23 18:05 justinchuby

I also got core dump on a constant Op. The output type is not correct. I will take a look.

chentong319 avatar May 09 '23 19:05 chentong319

I found the source of error: it's in the TypeInferenceOpInterface implementation for ConstantOp.

chentong319 avatar May 09 '23 20:05 chentong319

I fixed the import issue with PR #2232. But I ran into another error related to customized op:

loc("_aten_native_layer_norm_onnx_54"): error: 'onnx.Custom' op result #1 must be tensor of any type values or memref of any type values, but got 'none'
    %84:3 = "onnx.Custom"(%83, %2, %3) {axes = [-1], domain_name = "onnxscript.atenlib", eps = 9.99999974E-6 : f32, function_name = "_aten_native_layer_norm_onnx", onnx_node_name = "_aten_native_layer_norm_onnx_54"} : (tensor<*xf32>, tensor<2xf32>, tensor<2xf32>) -> (tensor<*xf32>, none, none)

I will fix this bug.

chentong319 avatar May 10 '23 04:05 chentong319

I fixed the NoneType problem for customOp. But there is another error in shape inference caused by

%42 = onnx.Constant {onnx_node_name = "Constant_12", value_ints = [-1]} : tensor<1xi64>

I remember I wrote a normalization of constant Op. But the code cannot be found anywhere.

chentong319 avatar May 10 '23 05:05 chentong319

ConstantOp may have attribute other than value_sparse or value_value, for instant value_int, value_ints and etc. I remember I wrote a bunch of transformation rules to normalize those attribute to value_value attribute (DenseElementsAttr). This onnx model has a Constant with value_ints, and shape inference pass still ran into this attribute and got an assertion error. @sorenlassen Since you worked a lot on the constant, you may know where the constant normalization code is.

chentong319 avatar May 10 '23 19:05 chentong319

in the lit test onnx_canonicalization.mlir we have a value_ints example: https://github.com/onnx/onnx-mlir/blob/main/test/mlir/onnx/onnx_canonicalization.mlir#LL627-L635 if I put that in a standalone test_constant_3.mlir file:

func.func @test_constant_3() -> tensor<3xi64> {
  %0 = onnx.Constant {value_ints = [1, 2, 3] } : tensor<3xi64>
  return %0 : tensor<3xi64>
}

then it succeeds if I run onnx-mlir-opt --canonicalize --shape-inference test_constant_3.mlir but fails if I omit --canonicalize

sorenlassen avatar May 10 '23 19:05 sorenlassen

maybe we should canonicalize before the first run of shape inference

I checked that the following change doesn't break any lit tests:

diff --git a/src/Compiler/CompilerPasses.cpp b/src/Compiler/CompilerPasses.cpp
index 5bcf5c70..7e0cd950 100644
--- a/src/Compiler/CompilerPasses.cpp
+++ b/src/Compiler/CompilerPasses.cpp
@@ -57,6 +57,7 @@ void addONNXToMLIRPasses(mlir::PassManager &pm, bool targetCPU) {
       std::make_unique<DisposableGarbageCollector>(pm.getContext()));

   pm.addNestedPass<func::FuncOp>(onnx_mlir::createDecomposeONNXToONNXPass());
+  pm.addPass(mlir::createCanonicalizerPass());
   if (enableONNXHybridPass) {
     // For starters only illustrating the new hybrid pass by replacing 3 passes
     // here. The plan is to replace most of the passes in addONNXToMLIRPasses.

but I haven't tested it on the gpt2 model

sorenlassen avatar May 10 '23 19:05 sorenlassen

I tried adding canonicalizing before shape inference with the patch in the previous message, on top of the fixes in PR #2232 and now onnx-mlir gpt2-dataprop.onnx fails with these messages:

loc("Slice_108"): error: Axes must be known at compile time
loc("Slice_108"): error: Failed to scan parameters successfully
loc("Slice_108"): error: shape inference failed

sorenlassen avatar May 10 '23 20:05 sorenlassen

Thanks. Pass ordering problem. Will we have the problem in the new hybrid transformation? I will try to add a canonicalization pass before shape inference.

chentong319 avatar May 10 '23 20:05 chentong319

Will we have the problem in the new hybrid transformation?

good question

onnx-mlir --onnx-hybrid-pass gpt2-dataprop.onnx doesn't crash, even without the extra canonicalizer pass, but prints

loc("Constant_12"): error: Require exactly one of the two attributes, either value or sparse_value
loc("Constant_12"): error: 'onnx.Constant' op shape inference failed

the hybrid pass infers shapes before canonicalization: https://github.com/onnx/onnx-mlir/blob/main/src/Transform/ONNX/ShapeInference.cpp#L74-L79 which might be the wrong thing to do for this example

sorenlassen avatar May 10 '23 21:05 sorenlassen

I tried adding canonicalizing before shape inference with the patch in the previous message, on top of the fixes in PR #2232 and now onnx-mlir gpt2-dataprop.onnx fails with these messages:

loc("Slice_108"): error: Axes must be known at compile time
loc("Slice_108"): error: Failed to scan parameters successfully
loc("Slice_108"): error: shape inference failed

I got this error if I also added canonicalization pass (in PR#2232) as you did. We can add the support of dynamic axes for Slice. The output shape will be Tensor<?x?x...?xT>.

chentong319 avatar May 11 '23 02:05 chentong319

@sorenlassen By the way, this model has lots of custom Ops. We can use it as a test case.

chentong319 avatar May 11 '23 02:05 chentong319

Adding a canonicalization pass caused the numerical test to fail. The error message is

error: type of return operand 0 ('tensor<?x?x1x5xf32>') doesn't match function result type ('tensor<*xf32>') in function @main_graph

Function type issue again? With further investigation needed, I just rollback my change in that PR.

chentong319 avatar May 11 '23 15:05 chentong319

test_fx_to_onnx_with_onnxruntime.TestFxToOnnxWithOnnxRuntime_op_level_debug_True_dynamic_shapes_True.test_gpt2_tiny_from_config.zip

(base) ➜  bin git:(main) ✗ ./onnx-mlir --EmitONNXIR  /home/justinchu/dev/onnx-mlir/test_fx_to_onnx_with_onnxruntime.TestFxToOnnxWithOnnxRuntime_op_level_debug_True_dynamic_shapes_True.test_gpt2_tiny_from_config.onnx
[1]    312747 segmentation fault (core dumped)  ./onnx-mlir --EmitONNXIR 
(base) ➜  bin git:(main) ✗ ./onnx-mlir /home/justinchu/dev/onnx-mlir/test_fx_to_on
nx_with_onnxruntime.TestFxToOnnxWithOnnxRuntime_op_level_debug_True_dynamic_shapes_True.test_gpt2_tiny_from_config.onnx 
[1]    312898 segmentation fault (core dumped)  ./onnx-mlir 

justinchuby avatar Sep 07 '23 04:09 justinchuby

This is going to be the type of models created by the new PyTorch 2.1 so just heads up

justinchuby avatar Sep 07 '23 04:09 justinchuby