Guillaume Klein issues

Results 26 issues of


                                            Guillaume Klein

Add missing lang tokens in M2M100Tokenizer.get_vocab

# What does this PR do? The lang tokens were missing from `M2M100Tokenizer.get_vocab`. The `get_vocab` method is updated to match other multilingual tokenizers such as `NllbTokenizer` and `MBart50Tokenizer`. ## Before...

gemm_s8s8s32: latest version is 1.4x slower than v0.21 with Intel MKL on AVX2

# Summary https://github.com/intel/mkl-dnn/commit/274be8228a0dba6391c2769c37cd68a3bb730fbf added AVX2 optimizations for igemm kernels (as discussed in https://github.com/intel/mkl-dnn/issues/532). However, the execution appears to be 1.4x slower than using version v0.21 compiled with Intel MKL. In...

performance

platform:x64

Add example to maximize CPU throughput using translate_batch

Converting M2M-100 with the latest Fairseq version fails with an error

Fairseq recently released a new version 0.12.1 to PyPI. This version is breaking the conversion of M2M-100 which fails with the following error: ```text Traceback (most recent call last): File...

bug

Use model replicas in TranslatorPool implementation

Support broadcasting in binary operators

The binary operators: * `ops::Add` * `ops::Mul` * `ops::Sub` currently do not support broadcasting. One should instead call the lower level primitives such as `add_depth_broadcast` which require device and type...

enhancement

help wanted

Add fused MatMul from cublasLt

The MatMul API from [cublasLt](https://docs.nvidia.com/cuda/cublas/index.html#using-the-cublasLt-api) can be configured to also add the bias and apply ReLU. We should look into this.

enhancement

help wanted

gpu

Refactor GEMM backend registration and execution

The GEMM backend is selected at runtime depending on the requested compute type and CPU information. The dispatch to the correct implementation is done with a switch statement: https://github.com/OpenNMT/CTranslate2/blob/3f6ac9cb22528c4b17b65783811f795ac6a85538/src/cpu/primitives.cc#L533-L612 This...

enhancement

Vectorize GEMM output dequantization and fuse bias + ReLU

The dequantization of GEMM output on CPU is currently not vectorized: https://github.com/OpenNMT/CTranslate2/blob/v1.17.0/src/ops/dequantize_cpu.cc The performance could be slightly improved by vectorizing this operation and fusing bias addition and ReLU. This is...

enhancement

cpu

Build ARM64 wheels for macOS

Similarly to the recent CTranslate2 work (https://github.com/OpenNMT/CTranslate2/pull/769), we should publish ARM64 wheels for macOS. I had a first look but did not immediately find the correct configuration to cross-compile ICU...

enhancement

help wanted