transformers icon indicating copy to clipboard operation
transformers copied to clipboard

AttributeError: 'NllbTokenizerFast' object has no attribute 'lang_code_to_id'

Open rajanish4 opened this issue 1 year ago • 7 comments

System Info

  • transformers version: 4.42.0.dev0
  • Platform: Windows-10-10.0.20348-SP0
  • Python version: 3.9.7
  • Huggingface_hub version: 0.23.3
  • Safetensors version: 0.4.3
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 1.13.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA RTX A6000

Who can help?

@ArthurZucker

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang="ron_Latn", token=token) model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=token)

article = "Şeful ONU spune că nu există o soluţie militară în Siria" inputs = tokenizer(article, return_tensors="pt") translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30) tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

Expected behavior

It should output translated text: UN-Chef sagt, es gibt keine militärische Lösung in Syrien

Complete error:

translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30) AttributeError: 'NllbTokenizerFast' object has no attribute 'lang_code_to_id'

rajanish4 avatar Jun 10 '24 10:06 rajanish4

Yes, we had a deprecation cycle and this attribute was removed 😉

ArthurZucker avatar Jun 10 '24 10:06 ArthurZucker

Thanks, but then how can i provide the language code for translation?

rajanish4 avatar Jun 10 '24 12:06 rajanish4

you should simply do tokenizer.encode("deu_Latn")[0]

ArthurZucker avatar Jun 10 '24 13:06 ArthurZucker

Then why the doc says otherwise? This is V4.42.0. I also don't understand how to use tokenizer.encode("deu_Latn")[0]. What's the keyword? Is this a positional argument? @ArthurZucker

buyukakyuz avatar Jul 01 '24 17:07 buyukakyuz

It seems there is an error: whatever the language code I gave to the NLLB tokenizer, it will always output English token id. My version is V4.42.3 @ArthurZucker :

image

fe1ixxu avatar Jul 02 '24 18:07 fe1ixxu

I think, tokenizer.encode("deu_Latn")[0] is the regular BOS token, tokenizer.encode("deu_Latn")[1] is the expected token. @ArthurZucker

ShayekhBinIslam avatar Jul 02 '24 21:07 ShayekhBinIslam

Yes! You should use convert_token_to_id rather than encode sorry 😉

ArthurZucker avatar Jul 10 '24 10:07 ArthurZucker

What worked for me is translated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("deu_Latn"), max_length=30)

tnitn avatar Jul 12 '24 18:07 tnitn

yep this is what we expect!

ArthurZucker avatar Jul 13 '24 09:07 ArthurZucker

It works for me : FR_CODE = tokenizer.convert_tokens_to_ids("fr_Latn") WO_CODE = tokenizer.convert_tokens_to_ids("wol_Latn")

LahadMbacke avatar Aug 02 '24 14:08 LahadMbacke

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 27 '24 08:08 github-actions[bot]

I'm getting this error using nllb-serve binary from the latest git version:

set 30 15:23:42 xeon1 nllb-serve[10089]:  * Running on http://10.0.0.6:6060
set 30 15:23:42 xeon1 nllb-serve[10089]: INFO:werkzeug:Press CTRL+C to quit
set 30 15:23:57 xeon1 nllb-serve[10089]: INFO:root:Loading tokenizer for facebook/nllb-200-distilled-600M; src_lang=eng_Latn ...
set 30 15:23:58 xeon1 nllb-serve[10089]: ERROR:nllb_serve.app:Exception on /translate [POST]
set 30 15:23:58 xeon1 nllb-serve[10089]: Traceback (most recent call last):
set 30 15:23:58 xeon1 nllb-serve[10089]:   File "/home/nllb-serve/nllb-serve/env/lib/python3.12/site-packages/flask/app.py", line 1463, in wsgi_app
set 30 15:23:58 xeon1 nllb-serve[10089]:     response = self.full_dispatch_request()
set 30 15:23:58 xeon1 nllb-serve[10089]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
set 30 15:23:58 xeon1 nllb-serve[10089]:   File "/home/nllb-serve/nllb-serve/env/lib/python3.12/site-packages/flask/app.py", line 872, in full_dispatch_request
set 30 15:23:58 xeon1 nllb-serve[10089]:     rv = self.handle_user_exception(e)
set 30 15:23:58 xeon1 nllb-serve[10089]:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
set 30 15:23:58 xeon1 nllb-serve[10089]:   File "/home/nllb-serve/nllb-serve/env/lib/python3.12/site-packages/flask/app.py", line 870, in full_dispatch_request
set 30 15:23:58 xeon1 nllb-serve[10089]:     rv = self.dispatch_request()
set 30 15:23:58 xeon1 nllb-serve[10089]:          ^^^^^^^^^^^^^^^^^^^^^^^
set 30 15:23:58 xeon1 nllb-serve[10089]:   File "/home/nllb-serve/nllb-serve/env/lib/python3.12/site-packages/flask/app.py", line 855, in dispatch_request
set 30 15:23:58 xeon1 nllb-serve[10089]:     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
set 30 15:23:58 xeon1 nllb-serve[10089]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
set 30 15:23:58 xeon1 nllb-serve[10089]:   File "/home/nllb-serve/nllb-serve/nllb_serve/app.py", line 145, in translate
set 30 15:23:58 xeon1 nllb-serve[10089]:     **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
set 30 15:23:58 xeon1 nllb-serve[10089]:                                   ^^^^^^^^^^^^^^^^^^^^^^^^^
set 30 15:23:58 xeon1 nllb-serve[10089]: AttributeError: 'NllbTokenizerFast' object has no attribute 'lang_code_to_id'

brauliobo avatar Sep 30 '24 18:09 brauliobo

downgrading transformers from 4.45.1 to 4.37.0 fixed the issue for me. found it here https://drsuneamer.tistory.com/250

brauliobo avatar Sep 30 '24 18:09 brauliobo

I need to edit a tokenizer to add a new language code, how can i do it on NllbTokenizer? On the code I'm using it is like that:

old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
tokenizer.lang_code_to_id[new_lang] = old_len-1
tokenizer.id_to_lang_code[old_len-1] = new_lang

thomas-ferraz avatar Dec 01 '24 22:12 thomas-ferraz

I need to edit a tokenizer to add a new language code, how can i do it on NllbTokenizer? On the code I'm using it is like that:

old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
tokenizer.lang_code_to_id[new_lang] = old_len-1
tokenizer.id_to_lang_code[old_len-1] = new_lang

Just to provide a response here to my own question, this code worked for me (maybe is not the best way, we can consider creating new functions for that.)

from transformers.tokenization_utils import AddedToken
old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
tokenizer._added_tokens_encoder[new_lang] = old_len-1
tokenizer._added_tokens_decoder[old_len-1] = AddedToken(new_lang, normalized=False, special=True)

if new_lang not in tokenizer._additional_special_tokens:
    tokenizer._additional_special_tokens.append(new_lang)

thomas-ferraz avatar Dec 01 '24 23:12 thomas-ferraz

There is a function that does exactly that 😉 tokenizer.add_token(AddedToken(new_lang, normalized=False, special=True)).

ArthurZucker avatar Dec 23 '24 15:12 ArthurZucker