contextualSpellCheck
contextualSpellCheck copied to clipboard
Bad performance for other language
Hello, I'm trying to use the contextual spell checker for Spanish. I run the script in https://github.com/R1j1t/contextualSpellCheck/blob/88bbbb46252c534679b185955fd88c239ed548a7/examples/ja_example.py with the following custom configuration:
import spacy
import contextualSpellCheck
nlp = spacy.load("es_dep_news_trf")
nlp.add_pipe(
"contextual spellchecker",
config={
"model_name": "bert-base-multilingual-cased",
"max_edit_dist": 2,
},
)
doc = nlp("La economia a crecido un dos por ciento.")
print(doc._.performed_spellCheck)
print(doc._.outcome_spellCheck)
but I don't get the desired result
La economia a crecido un dos por ciento should be corrected as La economía ha crecido un dos por ciento
Instead, I get La economia a crecido un dos por cento
If I use another pre-trained model (e.g. "model_name": "PlanTL-GOB-ES/roberta-large-bne") , the result keeps wrong:
Laeconomiaacrecidoundosporciento. ??
I wonder if I'm using the proper script to run the spellchecker in another language.
Hi @JuanFF, I have the following 2 observations:
-
contextualSpellCheck would be unable to change "a" to "ha". Details here
-
The problem with "ciento" is because of the bert model bert-base-multilingual-cased. Suppose the user passes no vocabulary (vocab) file. In that case, it uses the vocab of the bert model, and "ciento" is not available in it:
``` >>> tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased') >>> 'ciento' in tokenizer.get_vocab() False >>> doc._.suggestions_spellCheck {ciento: 'cento'} >>> # 'cento' is hundred in Portuguese (Brazil) >>> ```
If you dont want to change the bert model, I would suggest to pass the vocab file (example) separately like:
>>> vocab_path = "es_vocab.txt"
>>>
>>> nlp.add_pipe(
... "contextual spellchecker",
... config={
... "model_name": "bert-base-multilingual-cased",
... "max_edit_dist": 2,
... "vocab_path": vocab_path
... },
... )
testVocab.txt
inside vocab path
file opened!
Inside [unused....]
<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x7fa607daee80>
>>> doc = nlp("La economia a crecido un dos por ciento.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
La economia a crecido un dos por ciento.
>>>
I have a pending issue https://github.com/R1j1t/contextualSpellCheck/issues/44 on a similar topic, but lately, I have been pretty occupied. If you think you can contribute, please open a PR! The project would be glad to have your contribution!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.