marian-dev FP16 support

Should add pervasive FP16 support, not just calling tensor cores.

We've known about this for a while, but Facebook produced a number: 2.9x from FP16 https://arxiv.org/pdf/1806.00187.pdf

Jun 07 '18 19:06 kpu

Should be doable with the new type interface. Can you recommend someone?

Jun 07 '18 19:06 emjotde

Sooo, what do we do about this?

Sep 13 '18 20:09 emjotde

Hi; still no progress on this? :-)

Feb 08 '19 11:02 obo

Lots of progress actually.

Have something that's close to working. Testing and patching things up.

Feb 08 '19 14:02 emjotde

@emjotde Is your work on a separate branch? I'm interested to hear how things are going.

Feb 22 '19 18:02 erip

Just pushed to the types branch. Everything that uses transformer should work with --fp16 or the equivalent --precision float16 float32 --cost-scaling 7 2000 2 0.05. I think the RNNs like GRU or LSTM will blow up at the moment, just realized I forgot to type-proof those kernels.

I will attempt to merge this into master maybe next week. I am still having some problems with convergence for various models, so this is highly experimental and I will not accept complaints :)

Feb 23 '19 03:02 emjotde

OK, GRU and LSTMs with fp16 should work too, not properly tested.

Feb 23 '19 05:02 emjotde

@emjotde i'm trying out the types branch. --fp16 doesn't work when --guided-alignment is set. Seems to work fine without it though.

Apr 10 '19 11:04 rihardsk

Thanks a lot for the implementation of FP16, the speedup is awesome! Is there any easy way to use a model trained with --fp16 can be used to translate on a CPU? When I currently try to translate with marian-decoder on a CPU I get Error: CopyCastFrom from type float16 not implemented

Jul 18 '19 06:07 graftim

@emjotde whether the types branch has some problems unsolved to merge into master?

May i use --f16 in training and inference stage in related branch?

i use master branch v1.7.12, but i can compile successfully with nccl.

but i compile v1.7.8 types branch the same environment as branch master, i met some error as follows ( local env is cuda9.0): nvcc fatal : Unsupported gpu architecture 'compute_75' CMake Error at marian_cuda_generated_helpers.cu.o.cmake:203 (message): Error generating marian-dev-types/build/src/CMakeFiles/marian_cuda.dir/translator/./marian_cuda_generated_helpers.cu.o

src/CMakeFiles/marian_cuda.dir/build.make:112: recipe for target 'src/CMakeFiles/marian_cuda.dir/translator/marian_cuda_generated_helpers.cu.o' failed make[2]: *** [src/CMakeFiles/marian_cuda.dir/translator/marian_cuda_generated_helpers.cu.o] Error 1 CMakeFiles/Makefile2:134: recipe for target 'src/CMakeFiles/marian_cuda.dir/all' failed make[1]: *** [src/CMakeFiles/marian_cuda.dir/all] Error 2 Makefile:83: recipe for target 'all' failed make: *** [all] Error 2

then i compare the CmakeList.txt in branch master and types, then i replace the related place,but i still got some errors and my cuda_fp16.h file has : CUDA_FP16_DECL __half hsin(const __half a); CUDA_FP16_DECL __half hcos(const __half a); CUDA_FP16_DECL __half hlog(const __half a); CUDA_FP16_DECL __half hexp(const __half a); CUDA_FP16_DECL __half hsqrt(const __half a);

image

1 [ 5%] Built target nccl_install 2 [ 5%] Built target 3rd_party_installs 3 [ 5%] Building NVCC (Device) object src/CMakeFiles/marian_cuda.dir/translator/marian_cuda_generated_helpers.cu.o 4 /common-data/MT/tools/marian-dev-types/src/functional/operators.h(435): error: identifier "hsin" is undefined 5 6 marian-dev-types/src/functional/operators.h(436): error: identifier "hcos" is undefined 7 8 marian-dev-types/src/functional/operators.h(437): error: identifier "hsin" is undefined 9 10 marian-dev-types/src/functional/operators.h(437): error: identifier "hcos" is undefined 11 12 marian-dev-types/src/functional/operators.h(438): error: identifier "hlog" is undefined 13 14 marian-dev-types/src/functional/operators.h(439): error: identifier "hexp" is undefined 15 16 marian-dev-types/src/functional/operators.h(441): error: identifier "hsqrt" is undefined 17 18 /common-data/MT/tools/marian-dev-types/src/functional/operators.h(442): error: more than one conversion function from "const half" to a built-in type applies: 19 function "__half::operator float() const" 20 function "__half::operator short() const" 21 function "__half::operator unsigned short() const" 22 function "__half::operator int() const" 23 function "__half::operator unsigned int() const" 24 function "__half::operator long long() const" 25 function "__half::operator unsigned long long() const" 26 function "__half::operator __nv_bool() const" 27 28 /common-data/MT/tools/marian-dev-types/src/functional/operators.h(443): error: more than one conversion function from "half" to a built-in type applies: 29 function "__half::operator float() const" 30 function "__half::operator short() const" 31 function "__half::operator unsigned short() const" 32 function "__half::operator int() const" 33 function "__half::operator unsigned int() const" 34 function "__half::operator long long() const" 35 function "__half::operator unsigned long long() const" 36 function "__half::operator __nv_bool() const"

Sep 06 '19 01:09 520jefferson

--fp16 flag should now work in master for inference in marian-decoder and marian-scorer. Still working on it for training. Should come soon-ish.

Jan 06 '20 22:01 emjotde

I'm trying --fp16 on transformers trained w/ marian 1.7, attempting translations using --fp16 results in this:

(env) miguel@curie maestro (git)[update-test-model-en-pt] % cat explosion.log 
 /home/miguel/un/marian/build-1.10/marian-server -c marian-decbin.config --fp16 --quiet-translation
[2021-02-24 19:05:29] [marian] Marian v1.10.0 6f6d4846 2021-02-06 15:35:16 -0800
[2021-02-24 19:05:29] [marian] Running on curie as process 1247165 with command line:
[2021-02-24 19:05:29] [marian] /home/miguel/un/marian/build-1.10/marian-server -c marian-decbin.config --fp16 --quiet-translation
[2021-02-24 19:05:29] [config] alignment: soft
[2021-02-24 19:05:29] [config] allow-unk: false
[2021-02-24 19:05:29] [config] authors: false
[2021-02-24 19:05:29] [config] beam-size: 12
[2021-02-24 19:05:29] [config] bert-class-symbol: "[CLS]"
[2021-02-24 19:05:29] [config] bert-mask-symbol: "[MASK]"
[2021-02-24 19:05:29] [config] bert-masking-fraction: 0.15
[2021-02-24 19:05:29] [config] bert-sep-symbol: "[SEP]"
[2021-02-24 19:05:29] [config] bert-train-type-embeddings: true
[2021-02-24 19:05:29] [config] bert-type-vocab-size: 2
[2021-02-24 19:05:29] [config] best-deep: false
[2021-02-24 19:05:29] [config] build-info: ""
[2021-02-24 19:05:29] [config] cite: false
[2021-02-24 19:05:29] [config] cpu-threads: 1
[2021-02-24 19:05:29] [config] dec-cell: gru
[2021-02-24 19:05:29] [config] dec-cell-base-depth: 2
[2021-02-24 19:05:29] [config] dec-cell-high-depth: 1
[2021-02-24 19:05:29] [config] dec-depth: 6
[2021-02-24 19:05:29] [config] devices:
[2021-02-24 19:05:29] [config]   - 0
[2021-02-24 19:05:29] [config] dim-emb: 512
[2021-02-24 19:05:29] [config] dim-rnn: 1024
[2021-02-24 19:05:29] [config] dim-vocabs:
[2021-02-24 19:05:29] [config]   - 40044
[2021-02-24 19:05:29] [config]   - 40044
[2021-02-24 19:05:29] [config] dump-config: ""
[2021-02-24 19:05:29] [config] enc-cell: gru
[2021-02-24 19:05:29] [config] enc-cell-depth: 1
[2021-02-24 19:05:29] [config] enc-depth: 6
[2021-02-24 19:05:29] [config] enc-type: bidirectional
[2021-02-24 19:05:29] [config] ignore-model-config: false
[2021-02-24 19:05:29] [config] input:
[2021-02-24 19:05:29] [config]   - stdin
[2021-02-24 19:05:29] [config] input-types:
[2021-02-24 19:05:29] [config]   []
[2021-02-24 19:05:29] [config] interpolate-env-vars: false
[2021-02-24 19:05:29] [config] layer-normalization: false
[2021-02-24 19:05:29] [config] lemma-dim-emb: 0
[2021-02-24 19:05:29] [config] log: ""
[2021-02-24 19:05:29] [config] log-level: info
[2021-02-24 19:05:29] [config] log-time-zone: ""
[2021-02-24 19:05:29] [config] max-length: 10000
[2021-02-24 19:05:29] [config] max-length-crop: false
[2021-02-24 19:05:29] [config] max-length-factor: 3
[2021-02-24 19:05:29] [config] maxi-batch: 16
[2021-02-24 19:05:29] [config] maxi-batch-sort: none
[2021-02-24 19:05:29] [config] mini-batch: 1
[2021-02-24 19:05:29] [config] mini-batch-words: 0
[2021-02-24 19:05:29] [config] models:
[2021-02-24 19:05:29] [config]   - /home/miguel/data/un/test_data/models/en-pt/model.bin
[2021-02-24 19:05:29] [config] n-best: false
[2021-02-24 19:05:29] [config] no-spm-decode: false
[2021-02-24 19:05:29] [config] normalize: 1
[2021-02-24 19:05:29] [config] num-devices: 0
[2021-02-24 19:05:29] [config] output: stdout
[2021-02-24 19:05:29] [config] output-approx-knn:
[2021-02-24 19:05:29] [config]   []
[2021-02-24 19:05:29] [config] output-omit-bias: false
[2021-02-24 19:05:29] [config] output-sampling: false
[2021-02-24 19:05:29] [config] port: 8080
[2021-02-24 19:05:29] [config] precision:
[2021-02-24 19:05:29] [config]   - float16
[2021-02-24 19:05:29] [config] quiet: false
[2021-02-24 19:05:29] [config] quiet-translation: true
[2021-02-24 19:05:29] [config] relative-paths: false
[2021-02-24 19:05:29] [config] right-left: false
[2021-02-24 19:05:29] [config] seed: 0
[2021-02-24 19:05:29] [config] shortlist:
[2021-02-24 19:05:29] [config]   []
[2021-02-24 19:05:29] [config] skip: false
[2021-02-24 19:05:29] [config] skip-cost: false
[2021-02-24 19:05:29] [config] tied-embeddings: false
[2021-02-24 19:05:29] [config] tied-embeddings-all: true
[2021-02-24 19:05:29] [config] tied-embeddings-src: false
[2021-02-24 19:05:29] [config] transformer-aan-activation: swish
[2021-02-24 19:05:29] [config] transformer-aan-depth: 2
[2021-02-24 19:05:29] [config] transformer-aan-nogate: false
[2021-02-24 19:05:29] [config] transformer-decoder-autoreg: self-attention
[2021-02-24 19:05:29] [config] transformer-depth-scaling: false
[2021-02-24 19:05:29] [config] transformer-dim-aan: 2048
[2021-02-24 19:05:29] [config] transformer-dim-ffn: 2048
[2021-02-24 19:05:29] [config] transformer-ffn-activation: swish
[2021-02-24 19:05:29] [config] transformer-ffn-depth: 2
[2021-02-24 19:05:29] [config] transformer-guided-alignment-layer: last
[2021-02-24 19:05:29] [config] transformer-heads: 8
[2021-02-24 19:05:29] [config] transformer-no-projection: false
[2021-02-24 19:05:29] [config] transformer-pool: false
[2021-02-24 19:05:29] [config] transformer-postprocess: da
[2021-02-24 19:05:29] [config] transformer-postprocess-emb: d
[2021-02-24 19:05:29] [config] transformer-postprocess-top: ""
[2021-02-24 19:05:29] [config] transformer-preprocess: n
[2021-02-24 19:05:29] [config] transformer-tied-layers:
[2021-02-24 19:05:29] [config]   []
[2021-02-24 19:05:29] [config] transformer-train-position-embeddings: false
[2021-02-24 19:05:29] [config] tsv: false
[2021-02-24 19:05:29] [config] tsv-fields: 0
[2021-02-24 19:05:29] [config] type: transformer
[2021-02-24 19:05:29] [config] version: v1.7.6 9fd5ba9 2019-11-27 19:28:16 -0800
[2021-02-24 19:05:29] [config] vocabs:
[2021-02-24 19:05:29] [config]   - /home/miguel/data/un/test_data/models/en-pt/vocab.en.json
[2021-02-24 19:05:29] [config]   - /home/miguel/data/un/test_data/models/en-pt/vocab.pt.json
[2021-02-24 19:05:29] [config] weights:
[2021-02-24 19:05:29] [config]   - 1.0
[2021-02-24 19:05:29] [config] word-penalty: 0
[2021-02-24 19:05:29] [config] word-scores: false
[2021-02-24 19:05:29] [config] workspace: 512
[2021-02-24 19:05:29] [config] Loaded model has been created with Marian v1.7.6 9fd5ba9 2019-11-27 19:28:16 -0800
[2021-02-24 19:05:29] [data] Loading vocabulary from JSON/Yaml file /home/miguel/data/un/test_data/models/en-pt/vocab.en.json
[2021-02-24 19:05:30] [data] Loading vocabulary from JSON/Yaml file /home/miguel/data/un/test_data/models/en-pt/vocab.pt.json
[2021-02-24 19:05:30] [memory] Extending reserved space to 512 MB (device cpu0)
[2021-02-24 19:05:30] Loading scorer of type transformer as feature F0
[2021-02-24 19:05:30] Loading model from /home/miguel/data/un/test_data/models/en-pt/model.bin
[2021-02-24 19:05:30] Server is listening on port 8080

[2021-02-24 19:06:16] Error: Child 1 has different type (first: float32 != child: float16)
[2021-02-24 19:06:16] Error: Aborted from static marian::Type marian::NaryNodeOp::commonType(const std::vector<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase> > > >&) in /home/miguel/un/marian/src/graph/node.h:197

[CALL STACK]
[0x563c0c280ef0]    marian::NaryNodeOp::  commonType  (std::vector<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>>> const&) + 0x2b0
[0x563c0c2b00ff]    marian::ElementBinaryNodeOp::  ElementBinaryNodeOp  (IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>) + 0x36f
[0x563c0c2b0760]    IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>> marian::  Expression  <marian::PlusNodeOp,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>&,IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>&>(IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>&,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>&) + 0x80
[0x563c0c1f0e02]    marian::  operator+  (IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>) + 0x22
[0x563c0c3fc0b5]    marian::Transformer<marian::EncoderBase>::  Attention  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,  bool,  int) + 0x135
[0x563c0c423253]    marian::Transformer<marian::EncoderBase>::  MultiHead  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,  int,  int,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>> const&,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>> const&,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>> const&,  bool,  bool) + 0xb63
[0x563c0c424dd3]    marian::Transformer<marian::EncoderBase>::  LayerAttention  (std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>>,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>> const&,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>> const&,  IntrusivePtr<marian::Chainable<IntrusivePtr<marian::TensorBase>>> const&,  int,  bool,  bool) + 0x243
[0x563c0c425c6e]    marian::EncoderTransformer::  apply  (std::shared_ptr<marian::data::CorpusBatch>) + 0x95e
[0x563c0c426bfa]    marian::EncoderTransformer::  build  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0x9a
[0x563c0c4472a7]    marian::EncoderDecoder::  startState  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0xd7
[0x563c0c375e57]    marian::models::Stepwise::  startState  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0x87
[0x563c0c0f129b]    marian::ScorerWrapper::  startState  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0xeb
[0x563c0c0d3979]    marian::BeamSearch::  search  (std::shared_ptr<marian::ExpressionGraph>,  std::shared_ptr<marian::data::CorpusBatch>) + 0x729
[0x563c0bf2db44]    marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}::  operator()  (unsigned long) const + 0x164
[0x563c0bf2f111]    marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#1}::  operator()  () const + 0x31
[0x563c0bf2fcf4]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#1},std::allocator<int>,void ()>::_M_run()::{lambda()#1},void>>::  _M_invoke  (std::_Any_data const&) + 0x34
[0x563c0bebba1d]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[0x7f6ad944547f]                                                       + 0x1247f
[0x563c0bec06f3]    std::_Function_handler<void (),marian::ThreadPool::enqueue<marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&,unsigned long&>(std::result_of&&,(marian::TranslateService<marian::BeamSearch>::run(std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)::{lambda(unsigned long)#1}&)...)::{lambda()#3}>::  _M_invoke  (std::_Any_data const&) + 0x123
[0x563c0bebe368]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x168
[0x7f6ad9317d84]                                                       + 0xd6d84
[0x7f6ad943c609]                                                       + 0x9609
[0x7f6ad9005293]    clone                                              + 0x43

./run_marian.sh: line 5: 1247165 Aborted                 (core dumped) ~/un/marian/build-1.10/marian-server -c marian-decbin.config --fp16 --quiet-translation

the binary model was created like so:

(env) miguel@curie en-pt (git)[update-test-model-en-pt] % ~/un/marian/build-1.10/marian-conv -g packed16 
[2021-02-24 19:05:23] Outputting model.bin, precision: packed16
[2021-02-24 19:05:23] Loading model from model.npz
[2021-02-24 19:05:23] [memory] Reserving 246 MB, device cpu0
[2021-02-24 19:05:24] Finished

I've also tried: intgemm16, had the same problem.

Feb 24 '21 19:02 msf

I guess the gradients seems to be exploding when using --fp16 aware training.

[2021-03-09 13:30:04] NaN/Inf percentage 1.00 in 10 gradient updates, but cost-scaling factor 7.62939e-06 is already at minimum

Hence the loss is always at NaN:

[2021-03-09 13:44:27] Ep. 1 : Up. 1000 : Sen. 265,584 : Cost nan : Time 1064.13s : 12817.79 words/s : gNorm 0.0000

@emjotde Is it possible to use a model not trained using fp16 to be used with marian-decoder with --fp16 flag? I tried it, but the results seem to be bad. (on GPU)

Mar 10 '21 05:03 GokulNC

@msf Are you trying to decode on the GPU? If so then just use the normal model. You have been converting things to special CPU models.

@GokulNC Can you open a separate issue? Let's not turn this old issue into a catch-all for fp16.

Mar 10 '21 07:03 emjotde

@emjotde , this is an CPU, intel i7 8th gen.

Mar 12 '21 21:03 msf

@msf Oh, OK. Don't use the fp16 flag in that case, that's GPU fp16 only.

Mar 12 '21 21:03 emjotde

marian-dev marian-dev copied to clipboard

FP16 support

marian-dev
marian-dev copied to clipboard