Jaume Zaragoza

Results 124 comments of Jaume Zaragoza

I don't know if it does anything else but I personally prefer to use it. It seems that is also removing trailing and duplicate spaces, see https://github.com/google/sentencepiece/issues/650.

Most of the LLM vocabs use BPE and I remember back in the days when SentencePiece was establishing, some papers arguing that SP was worse than BPE for NMT. So...

Updated the issue with the BPE and case-aware options.

More: https://github.com/sinaahmadi/awesome-kurdish KTC corpus seems useful and maybe others

Sure! But I've made a few changes and now it seems that the error message shown by `hq log` is different every time (it sometimes even can finish completing the...

There you go: https://pastebin.com/x1hjApfH If the corruption might happen due to multiple processes (`hq server`) writing to the same log file, this is probably not the case. I'm not sharing...

Thanks! For now, I more or less have a workaround by just sleeping `--idle-timeout` + 15s after the `hq submit --progress` finishes, then remove the queue without `--force`.

I don't know if you solved this, but I ran into a similar issue that can explain yours. Some (if not all) OpusMT models do not use SentencePiece integrated into...

I think `split-digits`, ase mentioned in #887, can help with this. Copy behavior for numbers should be better as all the digits will be a bit more equally trained.

Serbian, Bosnian, Croatian and Montenegrin, all into English, can be solved with a single Latin model for all Serbo-Croatian. Adding [cyrtranslit](https://github.com/opendatakosovo/cyrillic-transliteration) to transliterate Cyrilic into latin. But if transliteraton can't...