fairseq
fairseq copied to clipboard
No documentation on how use nllb
π Documentation
Hey there,
I am trying to play with nllb but there isn't a basic code sample to try it.
I can download the checkpoint.pt
but I am not sure what I would do afterwards.
I notice it's also not available on torch.hub
Can you provide a piece of example code to get a translation?
hi, could you take a look at the generation command example here? https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling thanks!
hi, could you take a look at the generation command example here? https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling thanks!
Hi, I took a look at the README and I see notes about training but I didn't see how one might do inference to get a translation.
In other situations I see you'll use torch.hub, load it, and call translate(). Is there something similar?
Maybe I missed something?
Can you control-f "Generation/Evaluation" in the readme I linked?
We'll look into torch.hub :)
Can you control-f "Generation/Evaluation" in the readme I linked?
We'll look into torch.hub :)
It's probably not hard to find but this link is broken in that section: https://github.com/facebookresearch/flores/flores200
At least for me, I found that header a little difficult to grok. I think if it were called "Get a translation" or something, it'd be more plain spoken I suppose?
Anyway, I'll try to follow the instructions!
@Suhail Did you manage to get an example working?
@Suhail Did you manage to get an example working?
No - I decided to give up for now until someone makes something more accessible.
I agree. There needs to be an easy way to try out the translation feature given the checkpoint. I would like to see something of this nature: m = load_model(checkpoint_path) m.translate("Hello World", 'en', 'de') Hallo Welt
In addition there are hard coded paths for data files that aren't documented. For instance: "/data/nllb/nllb/flores200.en_xx_en.v4.4.256k/data_bin/shard000/dict.ace_Arab.txt" I and other users likely don't have these files (or I couldn't find them).
Please add some more accessible documentation or a simple example. There will also be a learning curve for some in using Hydra and setting up the config files - additional support or pointers will be useful to users who aren't experienced with this setup.
@314esther @Suhail Hi, You can check here for a convenient script to run the model inference from the command line without having to dealing with the config files.
Thanks pluiez! This gives much greater beginner usability of the inference capability. I know Hydra is a very capable tool especially when training over many gpus, however it's only familiar to a handful of people and the config files can really add a barrier to entry. There was one modification that I had to make to the fairseq code (nllb branch) in order to run the shell script. I needed to add a command line interface for spm_encode in the setup.py. Leaving an issue on NLLB_inference repo with more details (https://github.com/pluiez/NLLB-inference/issues/1).
Hi, sorry I didn't take this into consideration. I'm assuming these tools are all pre-installed. I will list the required steps before running the script.
Thanks @pluiez for your repo. I've made a video based on it giving code credits to you.
@amrrs Thank you for your sharing. Actually I hard-coded the language passed to normalize_punctuation.sh in translate.sh as zho_Hans. Although many languages share English(en) normalization under the hood, Tamil uses Hindi(hi). This has been fixed and you might want to checkout the latest version.
Oh my bad I didn't notice, Thank you for sharing it @pluiez I'll check out the code
Hi guys. I tested NLLB using huggingface transformers.
NOTE: You should install the latest dev version using below instruction in order to use NLLB tokenizer.
$ pip install git+https://github.com/huggingface/transformers.git
then... test it!
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
# available models: 'facebook/nllb-200-distilled-600M', 'facebook/nllb-200-1.3B', 'facebook/nllb-200-distilled-1.3B', 'facebook/nllb-200-3.3B'
model_name = 'facebook/nllb-200-distilled-600M'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
source = 'eng_Latn' # English
target = 'kor_Hang' # Korean
translator = pipeline('translation', model=model, tokenizer=tokenizer, src_lang=source, tgt_lang=target)
text = 'Hi, nice to meet you'
output = translator(text, max_length=400)
translated_text = output[0]['translation_text']
print(translated_text) # 'μλ
νμΈμ, λ°κ°μμ'
Language code is described in FLORES-200
Update: I made the huggingface space demo: https://huggingface.co/spaces/Geonmo/nllb-translation-demo
Hi, I am trying to train a NLLB model, but I still didn't find any doc about how to get the data_bin
or data_conf
yet, don't know how to format the train dataset, could you please share your training steps?
Could you please list some more detailed explanations about https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/modeling#filtering-and-preparing-the-data
@Suhail in case it's still of any use I made a short tutorial on how to run this directly in fairseq: https://github.com/facebookresearch/fairseq/issues/5292
:)
hi @geonm thank you for the available model list.
# available models: 'facebook/nllb-200-distilled-600M', 'facebook/nllb-200-1.3B', 'facebook/nllb-200-distilled-1.3B', 'facebook/nllb-200-3.3B'
Can you please where did you get this information? I mean names of these models. I was looking to try different models but could not find this information.