Jaume Zaragoza

Results 124 comments of Jaume Zaragoza

There are also corpora that (maybe it was UN) do not have ideographic full stop character and they have ascii fullstop instead. That should be fixed.

If the model has to be robust, probably it is a good thing to do? But in the case where chinese is target language, everything should be normalized to the...

Noticed also this and it's been the same always. I think the bottleneck is decoding. Doing n-best with beam 8 it seems to make much less use of GPU than...

Another alternative would be comparing with [ctranslate2](https://opennmt.net/CTranslate2/decoding.html), that has faster inference than marian.

Tried several times on cleaning monolingual data with monocleaner and then using it for other tasks (train word embeddings, do bakctranslation) and never found improvements. It sems that the fluency...

I think `transformer-dim-ffn` applies for both. But since the decoder is `ssru`, which is a recurrent network, the params that are applied to the decoder are being the `s2s` ones....

Localization data from software like I think it can really help with translation of short sentences, specially when #888 is fixed :sweat_smile: EDIT: although some language pairs may need a...

I was hoping there is a way to get rid of tokenization, but after reading all this, I don't see other ways of doing what OpusTrainer is doing without the...

Chinese should be trained in two different models, one for simplified and one for traditional, as both scripts might be too large vocabulary to fit in 32k pieces.

If this helps, you can run `fastspell-download` command during installation and that will download the model to the pythonpath.