Jaume Zaragoza comments

Results 124 comments of


                                            Jaume Zaragoza

Postprocess script does not remove starting space

I don't know if it does anything else but I personally prefer to use it. It seems that is also removing trailing and duplicate spaces, see https://github.com/google/sentencepiece/issues/650.

Vocabulary construction

Most of the LLM vocabs use BPE and I remember back in the days when SentencePiece was establishing, some papers arguing that SP was worse than BPE for NMT. So...

Vocabulary construction

Updated the issue with the BPE and case-aware options.

Resources for Kurdish

More: https://github.com/sinaahmadi/awesome-kurdish KTC corpus seems useful and maybe others

"Invalid entry type: 32" in hq log when force removing a submission queue

Sure! But I've made a few changes and now it seems that the error message shown by `hq log` is different every time (it sometimes even can finish completing the...

"Invalid entry type: 32" in hq log when force removing a submission queue

There you go: https://pastebin.com/x1hjApfH If the corruption might happen due to multiple processes (`hq server`) writing to the same log file, this is probably not the case. I'm not sharing...

"Invalid entry type: 32" in hq log when force removing a submission queue

Thanks! For now, I more or less have a workaround by just sleeping `--idle-timeout` + 15s after the `hq submit --progress` finishes, then remove the queue without `--force`.

How to develop a C++ tokenizer for MarianMT in C++

I don't know if you solved this, but I ran into a similar issue that can explain yours. Some (if not all) OpusMT models do not use SentencePiece integrated into...

Currency translation for English to German is incorrect

I think `split-digits`, ase mentioned in #887, can help with this. Copy behavior for numbers should be better as all the digits will be a bit more equally trained.

English to Serbian has low quality of the teacher models

Serbian, Bosnian, Croatian and Montenegrin, all into English, can be solved with a single Latin model for all Serbo-Croatian. Adding [cyrtranslit](https://github.com/opendatakosovo/cyrillic-transliteration) to transliterate Cyrilic into latin. But if transliteraton can't...