Guillaume Klein

Results 26 issues of Guillaume Klein

# What does this PR do? The lang tokens were missing from `M2M100Tokenizer.get_vocab`. The `get_vocab` method is updated to match other multilingual tokenizers such as `NllbTokenizer` and `MBart50Tokenizer`. ## Before...

# Summary https://github.com/intel/mkl-dnn/commit/274be8228a0dba6391c2769c37cd68a3bb730fbf added AVX2 optimizations for igemm kernels (as discussed in https://github.com/intel/mkl-dnn/issues/532). However, the execution appears to be 1.4x slower than using version v0.21 compiled with Intel MKL. In...

performance
platform:x64

Fairseq recently released a new version 0.12.1 to PyPI. This version is breaking the conversion of M2M-100 which fails with the following error: ```text Traceback (most recent call last): File...

bug

The binary operators: * `ops::Add` * `ops::Mul` * `ops::Sub` currently do not support broadcasting. One should instead call the lower level primitives such as `add_depth_broadcast` which require device and type...

enhancement
help wanted

The MatMul API from [cublasLt](https://docs.nvidia.com/cuda/cublas/index.html#using-the-cublasLt-api) can be configured to also add the bias and apply ReLU. We should look into this.

enhancement
help wanted
gpu

The GEMM backend is selected at runtime depending on the requested compute type and CPU information. The dispatch to the correct implementation is done with a switch statement: https://github.com/OpenNMT/CTranslate2/blob/3f6ac9cb22528c4b17b65783811f795ac6a85538/src/cpu/primitives.cc#L533-L612 This...

enhancement

The dequantization of GEMM output on CPU is currently not vectorized: https://github.com/OpenNMT/CTranslate2/blob/v1.17.0/src/ops/dequantize_cpu.cc The performance could be slightly improved by vectorizing this operation and fusing bias addition and ReLU. This is...

enhancement
cpu

Similarly to the recent CTranslate2 work (https://github.com/OpenNMT/CTranslate2/pull/769), we should publish ARM64 wheels for macOS. I had a first look but did not immediately find the correct configuration to cross-compile ICU...

enhancement
help wanted