contextualSpellCheck icon indicating copy to clipboard operation
contextualSpellCheck copied to clipboard

Bad performance for other language

Open JuanFF opened this issue 3 years ago • 3 comments

Hello, I'm trying to use the contextual spell checker for Spanish. I run the script in https://github.com/R1j1t/contextualSpellCheck/blob/88bbbb46252c534679b185955fd88c239ed548a7/examples/ja_example.py with the following custom configuration:

import spacy
import contextualSpellCheck

nlp = spacy.load("es_dep_news_trf")

nlp.add_pipe(
	"contextual spellchecker",
	config={
		"model_name": "bert-base-multilingual-cased",
		"max_edit_dist": 2,
	},
)

doc = nlp("La economia a crecido un dos por ciento.")
print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)

but I don't get the desired result

La economia a crecido un dos por ciento should be corrected as La economía ha crecido un dos por ciento Instead, I get La economia a crecido un dos por cento

If I use another pre-trained model (e.g. "model_name": "PlanTL-GOB-ES/roberta-large-bne") , the result keeps wrong: Laeconomiaacrecidoundosporciento. ?? I wonder if I'm using the proper script to run the spellchecker in another language.

JuanFF avatar Aug 16 '22 09:08 JuanFF

Hi @JuanFF, I have the following 2 observations:

  1. contextualSpellCheck would be unable to change "a" to "ha". Details here

  2. The problem with "ciento" is because of the bert model bert-base-multilingual-cased. Suppose the user passes no vocabulary (vocab) file. In that case, it uses the vocab of the bert model, and "ciento" is not available in it:

     ```
     >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
     >>> 'ciento' in tokenizer.get_vocab()
     False
     >>> doc._.suggestions_spellCheck
     {ciento: 'cento'}
     >>> # 'cento' is hundred in Portuguese (Brazil)
     >>>
     ```
    

If you dont want to change the bert model, I would suggest to pass the vocab file (example) separately like:


>>> vocab_path = "es_vocab.txt" 
>>> 
>>> nlp.add_pipe(
...     "contextual spellchecker",
...     config={
...             "model_name": "bert-base-multilingual-cased",
...             "max_edit_dist": 2,
...             "vocab_path": vocab_path
...     },
... )
testVocab.txt
inside vocab path
file opened!
Inside [unused....]
<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x7fa607daee80>
>>> doc = nlp("La economia a crecido un dos por ciento.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
La economia a crecido un dos por ciento.
>>> 

R1j1t avatar Aug 17 '22 18:08 R1j1t

I have a pending issue https://github.com/R1j1t/contextualSpellCheck/issues/44 on a similar topic, but lately, I have been pretty occupied. If you think you can contribute, please open a PR! The project would be glad to have your contribution!

R1j1t avatar Aug 17 '22 18:08 R1j1t

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Sep 20 '22 20:09 stale[bot]