Raphael Vienne issues

Results 6 issues of


                                            Raphael Vienne

MinHash: benchmark memory, speed and accuracy with varying r and b

This issue concerns fuzzy deduplication of text pairs. Find what's the tradeoff between memory, speed, accuracy when varying r and b. We need to find a way to use way...

enhancement

good first issue

download allenai nllb mined bitext

Partially closes https://github.com/gordicaleksa/Open-NLLB/issues/11 (analysis needs to be added) Allows downloading lang pairs from allenai nllb dataset (huggingface dataset): https://huggingface.co/datasets/allenai/nllb/tree/main I've stored the NLLB_PAIRS (pairs released with NLLB paper) and CCMATRIX_PAIRS...

Native language visualizations

Go through the files for your native language and see whether there are any issues. Check out the getting started document [here](https://github.com/gordicaleksa/Open-NLLB/blob/nllb_replication/GETTING_STARTED.md) for how to download public bi-text for your...

good first issue

question

Hydra pickle issue in generate_multi.py

Figure out the pickle issue mentioned here: https://github.com/facebookresearch/fairseq/issues/5315 # Conf file [conf.zip](https://github.com/gordicaleksa/Open-NLLB/files/12575097/conf.zip) # Current workaround: ``` @hydra.main(config_path="conf", config_name="generate_multi_full") def main(config: DictConfig) -> None: launcher = hydra.utils.instantiate(config.launcher) module = GenerateMultiModule(config) asyncio.run(module.run())...

bug

Choosing the LID model

# Available models - avoid lid.218 (original nllb LID) -> it's CC-BY-NC - https://github.com/laurieburchell/open-lid-dataset -> GPL License, 201 lang - https://fasttext.cc/docs/en/language-identification.html -> lid.176, CC-BY-SA license, 176 lang # Additional informations...

question

LID model peak probabilities

Find peak probability for LID output for right-skewed (left mode) languages. # Requirements - LID model - data # Reference - nllb paper (https://arxiv.org/abs/2207.04672) page 34-35

question