Raphael Vienne
Raphael Vienne
This issue concerns fuzzy deduplication of text pairs. Find what's the tradeoff between memory, speed, accuracy when varying r and b. We need to find a way to use way...
Partially closes https://github.com/gordicaleksa/Open-NLLB/issues/11 (analysis needs to be added) Allows downloading lang pairs from allenai nllb dataset (huggingface dataset): https://huggingface.co/datasets/allenai/nllb/tree/main I've stored the NLLB_PAIRS (pairs released with NLLB paper) and CCMATRIX_PAIRS...
Go through the files for your native language and see whether there are any issues. Check out the getting started document [here](https://github.com/gordicaleksa/Open-NLLB/blob/nllb_replication/GETTING_STARTED.md) for how to download public bi-text for your...
Figure out the pickle issue mentioned here: https://github.com/facebookresearch/fairseq/issues/5315 # Conf file [conf.zip](https://github.com/gordicaleksa/Open-NLLB/files/12575097/conf.zip) # Current workaround: ``` @hydra.main(config_path="conf", config_name="generate_multi_full") def main(config: DictConfig) -> None: launcher = hydra.utils.instantiate(config.launcher) module = GenerateMultiModule(config) asyncio.run(module.run())...
# Available models - avoid lid.218 (original nllb LID) -> it's CC-BY-NC - https://github.com/laurieburchell/open-lid-dataset -> GPL License, 201 lang - https://fasttext.cc/docs/en/language-identification.html -> lid.176, CC-BY-SA license, 176 lang # Additional informations...
Find peak probability for LID output for right-skewed (left mode) languages. # Requirements - LID model - data # Reference - nllb paper (https://arxiv.org/abs/2207.04672) page 34-35