fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

No documentation on how use nllb

Open Suhail opened this issue 1 year ago β€’ 18 comments

πŸ“š Documentation

Hey there,

I am trying to play with nllb but there isn't a basic code sample to try it.

I can download the checkpoint.pt but I am not sure what I would do afterwards.

I notice it's also not available on torch.hub

Can you provide a piece of example code to get a translation?

Suhail avatar Jul 06 '22 23:07 Suhail

hi, could you take a look at the generation command example here? https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling thanks!

huihuifan avatar Jul 07 '22 11:07 huihuifan

hi, could you take a look at the generation command example here? https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling thanks!

Hi, I took a look at the README and I see notes about training but I didn't see how one might do inference to get a translation.

In other situations I see you'll use torch.hub, load it, and call translate(). Is there something similar?

Maybe I missed something?

Suhail avatar Jul 07 '22 13:07 Suhail

Can you control-f "Generation/Evaluation" in the readme I linked?

We'll look into torch.hub :)

huihuifan avatar Jul 07 '22 13:07 huihuifan

Can you control-f "Generation/Evaluation" in the readme I linked?

We'll look into torch.hub :)

It's probably not hard to find but this link is broken in that section: https://github.com/facebookresearch/flores/flores200

At least for me, I found that header a little difficult to grok. I think if it were called "Get a translation" or something, it'd be more plain spoken I suppose?

Anyway, I'll try to follow the instructions!

Suhail avatar Jul 07 '22 13:07 Suhail

@Suhail Did you manage to get an example working?

nicholas-entis avatar Jul 08 '22 12:07 nicholas-entis

@Suhail Did you manage to get an example working?

No - I decided to give up for now until someone makes something more accessible.

Suhail avatar Jul 08 '22 15:07 Suhail

I agree. There needs to be an easy way to try out the translation feature given the checkpoint. I would like to see something of this nature: m = load_model(checkpoint_path) m.translate("Hello World", 'en', 'de') Hallo Welt

314esther avatar Jul 08 '22 19:07 314esther

In addition there are hard coded paths for data files that aren't documented. For instance: "/data/nllb/nllb/flores200.en_xx_en.v4.4.256k/data_bin/shard000/dict.ace_Arab.txt" I and other users likely don't have these files (or I couldn't find them).

Please add some more accessible documentation or a simple example. There will also be a learning curve for some in using Hydra and setting up the config files - additional support or pointers will be useful to users who aren't experienced with this setup.

314esther avatar Jul 08 '22 19:07 314esther

@314esther @Suhail Hi, You can check here for a convenient script to run the model inference from the command line without having to dealing with the config files.

pluiez avatar Jul 10 '22 03:07 pluiez

Thanks pluiez! This gives much greater beginner usability of the inference capability. I know Hydra is a very capable tool especially when training over many gpus, however it's only familiar to a handful of people and the config files can really add a barrier to entry. There was one modification that I had to make to the fairseq code (nllb branch) in order to run the shell script. I needed to add a command line interface for spm_encode in the setup.py. Leaving an issue on NLLB_inference repo with more details (https://github.com/pluiez/NLLB-inference/issues/1).

314esther avatar Jul 11 '22 03:07 314esther

Hi, sorry I didn't take this into consideration. I'm assuming these tools are all pre-installed. I will list the required steps before running the script.

pluiez avatar Jul 11 '22 05:07 pluiez

Thanks @pluiez for your repo. I've made a video based on it giving code credits to you.

amrrs avatar Jul 11 '22 21:07 amrrs

@amrrs Thank you for your sharing. Actually I hard-coded the language passed to normalize_punctuation.sh in translate.sh as zho_Hans. Although many languages share English(en) normalization under the hood, Tamil uses Hindi(hi). This has been fixed and you might want to checkout the latest version.

pluiez avatar Jul 12 '22 02:07 pluiez

Oh my bad I didn't notice, Thank you for sharing it @pluiez I'll check out the code

amrrs avatar Jul 12 '22 04:07 amrrs

Hi guys. I tested NLLB using huggingface transformers.

NOTE: You should install the latest dev version using below instruction in order to use NLLB tokenizer.

$ pip install git+https://github.com/huggingface/transformers.git

then... test it!

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# available models: 'facebook/nllb-200-distilled-600M', 'facebook/nllb-200-1.3B', 'facebook/nllb-200-distilled-1.3B', 'facebook/nllb-200-3.3B'
model_name = 'facebook/nllb-200-distilled-600M'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

source = 'eng_Latn' # English
target = 'kor_Hang' # Korean
translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang=source, tgt_lang=target)

text = 'Hi, nice to meet you'

output = translator(text, max_length=400)

translated_text = output[0]['translation_text']

print(translated_text) # 'μ•ˆλ…•ν•˜μ„Έμš”, λ°˜κ°€μ›Œμš”'

Language code is described in FLORES-200

Update: I made the huggingface space demo: https://huggingface.co/spaces/Geonmo/nllb-translation-demo

geonm avatar Jul 19 '22 04:07 geonm

Hi, I am trying to train a NLLB model, but I still didn't find any doc about how to get the data_bin or data_conf yet, don't know how to format the train dataset, could you please share your training steps? Could you please list some more detailed explanations about https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling#filtering-and-preparing-the-data

Python-37 avatar Sep 06 '22 08:09 Python-37

@Suhail in case it's still of any use I made a short tutorial on how to run this directly in fairseq: https://github.com/facebookresearch/fairseq/issues/5292

:)

gordicaleksa avatar Aug 21 '23 12:08 gordicaleksa

hi @geonm thank you for the available model list.

# available models: 'facebook/nllb-200-distilled-600M', 'facebook/nllb-200-1.3B', 'facebook/nllb-200-distilled-1.3B', 'facebook/nllb-200-3.3B'

Can you please where did you get this information? I mean names of these models. I was looking to try different models but could not find this information.

qaixerabbas avatar Mar 01 '24 13:03 qaixerabbas