marian-dev
marian-dev copied to clipboard
ONNX exporter in marian-conv is broken
When I build marian-dev with:
cmake .. -DUSE_ONNX=ON -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on -DCOMPILE_SERVER=on -DUSE_FBGEMM=on -DUSE_MPI=ON
make -j8
I get the following error:
In file included from /home/gokul/marian-dev/src/3rd_party/onnx/protobuf/onnx-ml.pb.cc:4,
from /home/gokul/marian-dev/src/3rd_party/onnx/protobuf/onnx-ml.pb-wrapper.cpp:29:
/home/gokul/marian-dev/src/3rd_party/onnx/protobuf/onnx-ml.pb.h:10:10: fatal error: google/protobuf/port_def.inc: No such file or directory
#include <google/protobuf/port_def.inc>
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [src/CMakeFiles/marian.dir/build.make:479: src/CMakeFiles/marian.dir/3rd_party/onnx/protobuf/onnx-ml.pb-wrapper.cpp.o] Error 1
Do you require any other details? Please let me know. Thanks!
The issue is with the protobuf version mentioned on Marian documentation: https://marian-nmt.github.io/docs/#ubuntu-packages
It mentions to install libprotobuf17
explicitly, which corresponds to version 3.6.
But the file which was said missing above is available only from v3.7, so I had to uninstall and reinstall some later version.
In my case, I installed what was available in my apt-cache
(from buster-backports
repo), which were:
sudo apt install libprotobuf-dev=3.12.3-2~bpo10+1 protobuf-compiler=3.12.3-2~bpo10+1
which implicitly installed the dependency libprotobuf23
.
After doing the above, I was hoping for it to work, it didn't. It threw a large number of errors while building marian-dev again, mostly errors arising from marian-dev/src/3rd_party/onnx/
I have attached the error logs. Can you please check if that helps to debug? marian_onnx_errors.log
I also regenerated the protobuf-compiled files again as mentioned here: (in-case the output varies based on pb version) https://github.com/marian-nmt/marian-dev/blob/c84599d08ad69059279abd5a7417a8053db8b631/src/3rd_party/onnx/protobuf/onnx-ml.pb-wrapper.cpp#L13-L14
But that didn't help.
Also, where do these PRs come from? https://github.com/marian-nmt/marian-dev/commits/master/src/3rd_party/onnx/protobuf
TL;DR -- What is the exact protobuf
version I am supposed to use with Marian inorder for ONNX export to work? @frankseide
Hmm, so the issue was with the version of protobuf-lite
in the SentencePiece fork, which was v3.6 , and conflicting with v3.12 which I was trying to build Marian with (for ONNX exports to work). Cmaking with -DUSE_SENTENCEPIECE=off
solved the issue build.
One more bug is that, this line: https://github.com/marian-nmt/marian-dev/blob/b64e258bda3d9134a39f41229776b630bc187094/src/onnx/expression_graph_onnx_exporter.cpp#L8
Should be changed like this:
#include "tensors/cpu/expression_graph_packable.h"
One temporary thing that you could do in the CMakefile is that, ensuring USE_ONNX
and USE_SENTENCEPIECE
is not simultaneously switched on (until the above conflict is fixed).
When I was trying to run marian_to_onnx_example.py, it threw the following error:
Error
marian-dev/scripts/onnx$ python3 marian_to_onnx_example.py
[2022-01-21 04:30:06] Outputting /tmp/model.npz.best-bleu-detok.npz, precision: float32
[2022-01-21 04:30:06] Loading model from /home/gokul/experiments/v5.4--indic_roman_to_eng/model/model.npz.best-bleu-detok.npz
[2022-01-21 04:30:11] [memory] Reserving 860 MB, device cpu0[2022-01-21 04:30:11] [data] Loading vocabulary from text file /home/gokul/Conversion/onnx/rom.wl
[2022-01-21 04:30:11] [data] Loading vocabulary from text file /home/gokul/Conversion/onnx/en.wl
[2022-01-21 04:30:11] Error: Required option 'factors-combine' has not been set
[2022-01-21 04:30:11] Error: Aborted from T marian::Options::get(const char*) const [with T = std::__cxx11::basic_string<char>] in /home/gokul/marian-dev/src/common/options.h:134
[CALL STACK]
[0x5614fb7f9cb6] std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> marian::Options:: get <std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>(char const*) const + 0x236
[0x5614fbe947e7] marian::EncoderDecoderLayerBase:: createEmbeddingLayer () const + 0x97
[0x5614fbe95365] marian::EncoderDecoderLayerBase:: getEmbeddingLayer (bool) const + 0x125
[0x5614fbbd55c5] marian::EncoderTransformer:: apply (std::shared_ptr<marian::data::CorpusBatch>) + 0xb5
[0x5614fbbd6865] marian::EncoderTransformer:: build (std::shared_ptr<marian::ExpressionGraph>, std::shared_ptr<marian::data::CorpusBatch>) + 0x85
[0x5614fbc0669d] marian::EncoderDecoder:: startState (std::shared_ptr<marian::ExpressionGraph>, std::shared_ptr<marian::data::CorpusBatch>) + 0xad
[0x5614fbb504d8] marian::models::Stepwise:: startState (std::shared_ptr<marian::ExpressionGraph>, std::shared_ptr<marian::data::CorpusBatch>) + 0x68
[0x5614fbae10a6] marian::ExpressionGraphONNXExporter:: exportToONNX (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&,
std::shared_ptr<marian::Options>, std::vector<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,std::allocator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>> const&) + 0xa36
[0x5614fb752c2a] mainConv (int, char**) + 0x132a
[0x5614fb70a5ce] main + 0x11e
[0x7f510b7b209b] __libc_start_main + 0xeb
[0x5614fb74cfea] _start + 0x2a
Traceback (most recent call last):
File "marian_to_onnx_example.py", line 24, in <module>
partial_models = mo.export_marian_model_components(marian_npz, marian_vocs)
File "/home/gokul/marian-dev/scripts/onnx/marian_to_onnx.py", line 102, in export_marian_model_components
subprocess.run([command] + args, check=True)
File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/gokul/marian-dev/scripts/onnx/../../build/marian', 'convert', '--from', '/home/gokul/experiments/v5.4--indic_roman_to_eng/model/model.npz.best-bleu-detok.npz', '--vocabs', '/home/gokul/Conversion/onnx/rom.wl', '/home/gokul/Conversion/onnx/en.wl', '--to', '/tmp/model.npz.best-bleu-detok.npz', '--export-as', 'onnx-encode']' died with <Signals.SIGABRT: 6>.
For some reason, it requires the options "factors-combine" and "factors-dim-emb". As a temporary hack, I hardcoded those options here: https://github.com/marian-nmt/marian-dev/blob/b64e258bda3d9134a39f41229776b630bc187094/src/command/marian_conv.cpp#L114
But again, during conversion, there was a segfault:
Error
[2022-01-21 09:13:51] [graph] After creating expanded nodes, we now have 1768 nodes
[2022-01-21 09:13:51] [onnx] Exporting graph decode_first
[2022-01-21 09:13:51] Error: Segmentation fault
[2022-01-21 09:13:51] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /home/gokul/marian-dev/src/common/logging.cpp:130
[CALL STACK]
[0x5589b0dfb67e] + 0x39667e
[0x5589b0dfb8f9] + 0x3968f9
[0x7f8561108730] + 0x12730
[0x5589b107f420] + 0x61a420
[0x5589b107f981] marian::ExpressionGraphONNXExporter:: rebuildNodesForward (marian::InputsMap const&, std::vector<std::pair<std::__cxx11::basic_string<ch
ar,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>,std::allocator<std::pair<std::__cxx11::basic
_string<char,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>> const&) + 0x201
[0x5589b108e04a] marian::ExpressionGraphONNXExporter:: serializeToONNX (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const
&, std::map<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,std::pair<std::vector<std::pair<std::__cxx11::basic_string<char,std::
char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>,std::allocator<std::pair<std::__cxx11::basic_string<
char,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>>,std::vector<std::pair<std::__cxx11::basi
c_string<char,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>,std::allocator<std::pair<std::__c
xx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>>>>,std::less<std::__cxx
11::basic_string<char,std::char_traits<char>,std::allocator<char>>>,s[0x5589b1069cee] marian::ExpressionGraphONNXExporter:: exportToONNX (std::__cxx11::ba
sic_string<char,std::char_traits<char>,std::allocator<char>> const&, std::shared_ptr<marian::Options>, std::vector<std::__cxx11::basic_string<char,std::char_
traits<char>,std::allocator<char>>,std::allocator<std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>>> const&) + 0x20de
[0x5589b0d49805] main + 0x12b5
[0x7f8534a9709b] __libc_start_main + 0xeb
[0x5589b0d8031a] _start + 0x2a
The options "factors-combine" and "factors-dim-emb" have been added recently and I think it seems the check whether they are provided or not is missing. The issue wasn't caught because most marian executables have these options defined with default values. Thanks for reporting this.
Regarding ONNX, it hasn't been continuously tested recently, so it may require some updates. Please feel free to open a pull request if you find that something doesn't work properly and you manage to fix it.
Hey, so I found how to somewhat go further then this @GokulNC but I wouldn't go this path unless you're ready to spend some time. I found your issue basically by chance, I've been following the same process as you have.
Using gdb I found that your last error came from the node being a null pointer, so just adding this around here does the trick:
if (!node)
return;
After that I got other errors, because some of the nodes on the Marian model I was using do not exist in ONNX (namely, the swish
/ silu
activation function). I replaced this by a dummy ReLu node in the mapping here and was able to advance until the export of the decode_next
graph :champagne: But then another error was thrown basically at launch, due to another null pointer error.
Anyway, I then went on to check on the graph of the decoder's first step, and found that half of the decoder hidden states outputs were missing, I think the ones for cross-attention. This was maybe due to the "fix" I added, but yeah anyway the code seems some pretty serious looking at, and I'm not fluent enough in C++ to go further ^^"
thanks to your all, it works! I got
[2022-08-26 07:08:09] [onnx] ONNX graph 'decode_first' written to model.onnx.decode_first.onnx [2022-08-26 07:08:09] [onnx] Exporting graph decode_next [2022-08-26 07:08:09] Error: Segmentation fault [2022-08-26 07:08:09] Error: Aborted from setErrorHandlers()::<lambda(int, siginfo_t*, void*)> in /home/youxixie/008-Marian-Onnx/marian/src/common/logging.cpp:130
another question:how to use onnx model (model.onnx.decode_first.onnx) in marian mt project(c++)?
@snukky have any plan to solve the error about “ Exporting graph decode_next”? I located the problem, but I don't know how to modify it:
for (const auto& dss : extractStates(decodeFirstState)) { std::cout << "value_type():" << dss->value_type() << std::endl; inputs.emplace_back(std::make_pair("decoder_state_" + std::to_string(inputs.size() - (numEncoders*2 + 2)), dss)); }
the sec dss is null, tips segmentation fault
I'm not sure if this problem has been fixed. but I get three onnx model(encode_source、decode_first、decode_next).
Use tool netron found that decode_first outputs have no first_decoder_state_1&3&...
so modify here can do this trick:
for (const auto& dss : extractStates(decodeFirstState)) { if(iIdx % 2 == 0) { inputs.emplace_back(std::make_pair("decoder_state_" + std::to_string(iIdx), dss)); } iIdx++; }
It‘s not work in script marian_to_onnx_example.py:
still need to fix null point: