Jaume Zaragoza comments

Results 124 comments of


                                            Jaume Zaragoza

Implement corpus specific fixes for CJK

There are also corpora that (maybe it was UN) do not have ideographic full stop character and they have ascii fullstop instead. That should be fixed.

Implement corpus specific fixes for CJK

If the model has to be robust, probably it is a good thing to do? But in the case where chinese is target language, everything should be normalized to the...

Improve GPU utilization for "translate" tasks

Noticed also this and it's been the same always. I think the bottleneck is decoding. Doing n-best with beam 8 it seems to make much less use of GPU than...

Improve GPU utilization for "translate" tasks

Another alternative would be comparing with [ctranslate2](https://opennmt.net/CTranslate2/decoding.html), that has faster inference than marian.

Consider using monocleaner for cleaning monolingual corpuses

Tried several times on cleaning monolingual data with monocleaner and then using it for other tasks (train word embeddings, do bakctranslation) and never found improvements. It sems that the fluency...

Experiment with the decoder sizes

I think `transformer-dim-ffn` applies for both. But since the decoder is `ssru`, which is a recurrent network, the params that are applied to the decoder are being the `s2s` ones....

Use our localization data for training

Localization data from software like I think it can really help with translation of short sentences, specially when #888 is fixed :sweat_smile: EDIT: although some language pairs may need a...

Investigate OpusTrainer compatibility for CJK

I was hoping there is a way to get rid of tokenization, but after reading all this, I don't see other ways of doing what OpusTrainer is doing without the...

Check training for CJK

Chinese should be trained in two different models, one for simplified and one for traditional, as both scripts might be too large vocabulary to fit in 32k pieces.

bicleaner-ai-classify intermittently fails to download fasttext model

If this helps, you can run `fastspell-download` command during installation and that will download the model to the pythonpath.