marian-dev icon indicating copy to clipboard operation
marian-dev copied to clipboard

ONNX exporter in marian-conv is broken

Open GokulNC opened this issue 3 years ago • 11 comments

When I build marian-dev with:

cmake .. -DUSE_ONNX=ON -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DCOMPILE_SERVER=on -DUSE_FBGEMM=on -DUSE_MPI=ON
make -j8

I get the following error:

In file included from /home/gokul/marian-dev/src/3rd_party/onnx/protobuf/onnx-ml.pb.cc:4,                                                                     
                 from /home/gokul/marian-dev/src/3rd_party/onnx/protobuf/onnx-ml.pb-wrapper.cpp:29:                                                           
/home/gokul/marian-dev/src/3rd_party/onnx/protobuf/onnx-ml.pb.h:10:10: fatal error: google/protobuf/port_def.inc: No such file or directory                   
 #include <google/protobuf/port_def.inc>
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [src/CMakeFiles/marian.dir/build.make:479: src/CMakeFiles/marian.dir/3rd_party/onnx/protobuf/onnx-ml.pb-wrapper.cpp.o] Error 1  

Do you require any other details? Please let me know. Thanks!

GokulNC avatar Jan 18 '22 04:01 GokulNC

The issue is with the protobuf version mentioned on Marian documentation: https://marian-nmt.github.io/docs/#ubuntu-packages

It mentions to install libprotobuf17 explicitly, which corresponds to version 3.6. But the file which was said missing above is available only from v3.7, so I had to uninstall and reinstall some later version.

In my case, I installed what was available in my apt-cache (from buster-backports repo), which were:

sudo apt install libprotobuf-dev=3.12.3-2~bpo10+1 protobuf-compiler=3.12.3-2~bpo10+1

which implicitly installed the dependency libprotobuf23.


After doing the above, I was hoping for it to work, it didn't. It threw a large number of errors while building marian-dev again, mostly errors arising from marian-dev/src/3rd_party/onnx/

I have attached the error logs. Can you please check if that helps to debug? marian_onnx_errors.log


I also regenerated the protobuf-compiled files again as mentioned here: (in-case the output varies based on pb version) https://github.com/marian-nmt/marian-dev/blob/c84599d08ad69059279abd5a7417a8053db8b631/src/3rd_party/onnx/protobuf/onnx-ml.pb-wrapper.cpp#L13-L14

But that didn't help.


Also, where do these PRs come from? https://github.com/marian-nmt/marian-dev/commits/master/src/3rd_party/onnx/protobuf

GokulNC avatar Jan 18 '22 07:01 GokulNC

TL;DR -- What is the exact protobuf version I am supposed to use with Marian inorder for ONNX export to work? @frankseide

GokulNC avatar Jan 18 '22 07:01 GokulNC

Hmm, so the issue was with the version of protobuf-lite in the SentencePiece fork, which was v3.6 , and conflicting with v3.12 which I was trying to build Marian with (for ONNX exports to work). Cmaking with -DUSE_SENTENCEPIECE=off solved the issue build.

One more bug is that, this line: https://github.com/marian-nmt/marian-dev/blob/b64e258bda3d9134a39f41229776b630bc187094/src/onnx/expression_graph_onnx_exporter.cpp#L8

Should be changed like this:

#include "tensors/cpu/expression_graph_packable.h"

One temporary thing that you could do in the CMakefile is that, ensuring USE_ONNX and USE_SENTENCEPIECE is not simultaneously switched on (until the above conflict is fixed).

GokulNC avatar Jan 21 '22 04:01 GokulNC

When I was trying to run marian_to_onnx_example.py, it threw the following error:

Error
marian-dev/scripts/onnx$ python3 marian_to_onnx_example.py                                                                     
[2022-01-21 04:30:06] Outputting /tmp/model.npz.best-bleu-detok.npz, precision: float32                                                                       
[2022-01-21 04:30:06] Loading model from /home/gokul/experiments/v5.4--indic_roman_to_eng/model/model.npz.best-bleu-detok.npz
[2022-01-21 04:30:11] [memory] Reserving 860 MB, device cpu0[2022-01-21 04:30:11] [data] Loading vocabulary from text file /home/gokul/Conversion/onnx/rom.wl
[2022-01-21 04:30:11] [data] Loading vocabulary from text file /home/gokul/Conversion/onnx/en.wl
[2022-01-21 04:30:11] Error: Required option 'factors-combine' has not been set
[2022-01-21 04:30:11] Error: Aborted from T marian::Options::get(const char*) const [with T = std::__cxx11::basic_string<char>] in /home/gokul/marian-dev/src/common/options.h:134

[CALL STACK]
[0x5614fb7f9cb6]    std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> marian::Options::  get  <std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>(char const*) const + 0x236
[0x5614fbe947e7]    marian::EncoderDecoderLayerBase::  createEmbeddingLayer  () const + 0x97                                                                  
[0x5614fbe95365]    marian::EncoderDecoderLayerBase::  getEmbeddingLayer  (bool) const + 0x125                                                                
[0x5614fbbd55c5]    marian::EncoderTransformer::  apply  (std::shared_ptr<marian::data::CorpusBatch>) + 0xb5                                                  
[0x5614fbbd6865]    marian::EncoderTransformer::  build  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0x85       
[0x5614fbc0669d]    marian::EncoderDecoder::  startState  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0xad      
[0x5614fbb504d8]    marian::models::Stepwise::  startState  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0x68    
[0x5614fbae10a6]    marian::ExpressionGraphONNXExporter::  exportToONNX  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,
 std::shared_ptr<marian::Options>,  std::vector<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,std::allocator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>> const&) + 0xa36
[0x5614fb752c2a]    mainConv  (int,  char**)                           + 0x132a                                                                               
[0x5614fb70a5ce]    main                                               + 0x11e
[0x7f510b7b209b]    __libc_start_main                                  + 0xeb
[0x5614fb74cfea]    _start                                             + 0x2a

Traceback (most recent call last):
  File "marian_to_onnx_example.py", line 24, in <module>
    partial_models = mo.export_marian_model_components(marian_npz, marian_vocs)                                                                               
  File "/home/gokul/marian-dev/scripts/onnx/marian_to_onnx.py", line 102, in export_marian_model_components                                                   
    subprocess.run([command] + args, check=True)
  File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/gokul/marian-dev/scripts/onnx/../../build/marian', 'convert', '--from', '/home/gokul/experiments/v5.4--indic_roman_to_eng/model/model.npz.best-bleu-detok.npz', '--vocabs', '/home/gokul/Conversion/onnx/rom.wl', '/home/gokul/Conversion/onnx/en.wl', '--to', '/tmp/model.npz.best-bleu-detok.npz', '--export-as', 'onnx-encode']' died with <Signals.SIGABRT: 6>. 

For some reason, it requires the options "factors-combine" and "factors-dim-emb". As a temporary hack, I hardcoded those options here: https://github.com/marian-nmt/marian-dev/blob/b64e258bda3d9134a39f41229776b630bc187094/src/command/marian_conv.cpp#L114

But again, during conversion, there was a segfault:

Error
[2022-01-21 09:13:51] [graph] After creating expanded nodes, we now have 1768 nodes
[2022-01-21 09:13:51] [onnx] Exporting graph decode_first
[2022-01-21 09:13:51] Error: Segmentation fault
[2022-01-21 09:13:51] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /home/gokul/marian-dev/src/common/logging.cpp:130

[CALL STACK]
[0x5589b0dfb67e]                                                       + 0x39667e
[0x5589b0dfb8f9]                                                       + 0x3968f9
[0x7f8561108730]                                                       + 0x12730
[0x5589b107f420]                                                       + 0x61a420
[0x5589b107f981]    marian::ExpressionGraphONNXExporter::  rebuildNodesForward  (marian::InputsMap const&,  std::vector<std::pair<std::__cxx11::basic_string<ch
ar,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>,std::allocator<std::pair<std::__cxx11::basic
_string<char,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>> const&) + 0x201
[0x5589b108e04a]    marian::ExpressionGraphONNXExporter::  serializeToONNX  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const
&,  std::map<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,std::pair<std::vector<std::pair<std::__cxx11::basic_string<char,std::
char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>,std::allocator<std::pair<std::__cxx11::basic_string<
char,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>>,std::vector<std::pair<std::__cxx11::basi
c_string<char,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>,std::allocator<std::pair<std::__c
xx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>>>,std::less<std::__cxx
11::basic_string<char,std::char_traits<char>,std::allocator<char>>>,s[0x5589b1069cee]    marian::ExpressionGraphONNXExporter::  exportToONNX  (std::__cxx11::ba
sic_string<char,std::char_traits<char>,std::allocator<char>> const&,  std::shared_ptr<marian::Options>,  std::vector<std::__cxx11::basic_string<char,std::char_
traits<char>,std::allocator<char>>,std::allocator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>> const&) + 0x20de
[0x5589b0d49805]    main                                               + 0x12b5
[0x7f8534a9709b]    __libc_start_main                                  + 0xeb
[0x5589b0d8031a]    _start                                             + 0x2a

GokulNC avatar Jan 21 '22 09:01 GokulNC

The options "factors-combine" and "factors-dim-emb" have been added recently and I think it seems the check whether they are provided or not is missing. The issue wasn't caught because most marian executables have these options defined with default values. Thanks for reporting this.

Regarding ONNX, it hasn't been continuously tested recently, so it may require some updates. Please feel free to open a pull request if you find that something doesn't work properly and you manage to fix it.

snukky avatar Jan 21 '22 13:01 snukky

Hey, so I found how to somewhat go further then this @GokulNC but I wouldn't go this path unless you're ready to spend some time. I found your issue basically by chance, I've been following the same process as you have.

Using gdb I found that your last error came from the node being a null pointer, so just adding this around here does the trick:

if (!node)
    return;

After that I got other errors, because some of the nodes on the Marian model I was using do not exist in ONNX (namely, the swish / silu activation function). I replaced this by a dummy ReLu node in the mapping here and was able to advance until the export of the decode_next graph :champagne: But then another error was thrown basically at launch, due to another null pointer error.

Anyway, I then went on to check on the graph of the decoder's first step, and found that half of the decoder hidden states outputs were missing, I think the ones for cross-attention. This was maybe due to the "fix" I added, but yeah anyway the code seems some pretty serious looking at, and I'm not fluent enough in C++ to go further ^^"

romain-keramitas-prl avatar Feb 01 '22 15:02 romain-keramitas-prl

thanks to your all, it works! I got

[2022-08-26 07:08:09] [onnx] ONNX graph 'decode_first' written to model.onnx.decode_first.onnx [2022-08-26 07:08:09] [onnx] Exporting graph decode_next [2022-08-26 07:08:09] Error: Segmentation fault [2022-08-26 07:08:09] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /home/youxixie/008-Marian-Onnx/marian/src/common/logging.cpp:130

another question:how to use onnx model (model.onnx.decode_first.onnx) in marian mt project(c++)?

xyx361100238 avatar Aug 26 '22 07:08 xyx361100238

@snukky have any plan to solve the error about “ Exporting graph decode_next”? I located the problem, but I don't know how to modify it:

for (const auto& dss : extractStates(decodeFirstState)) { std::cout << "value_type():" << dss->value_type() << std::endl; inputs.emplace_back(std::make_pair("decoder_state_" + std::to_string(inputs.size() - (numEncoders*2 + 2)), dss)); }

the sec dss is null, tips segmentation fault

xyx361100238 avatar Aug 29 '22 10:08 xyx361100238

I'm not sure if this problem has been fixed. but I get three onnx model(encode_source、decode_first、decode_next). Use tool netron found that decode_first outputs have no first_decoder_state_1&3&... image so modify here can do this trick:

for (const auto& dss : extractStates(decodeFirstState)) { if(iIdx % 2 == 0) { inputs.emplace_back(std::make_pair("decoder_state_" + std::to_string(iIdx), dss)); } iIdx++; }

xyx361100238 avatar Aug 31 '22 07:08 xyx361100238

It‘s not work in script marian_to_onnx_example.py: image

xyx361100238 avatar Sep 01 '22 04:09 xyx361100238

still need to fix null point: image

xyx361100238 avatar Sep 01 '22 07:09 xyx361100238