aleph
aleph copied to clipboard
BUG: Names not extracted as mentions
Describe the bug Names/entities are not detected as mentions for html documents I am uploading. When I run spaCy locally it doesn't detect the names in my documents. I am parsing the HTML as I see it parsed in html.py on ingest GitHub and using the es_core_news_sm nlp model. These documents are Spanish language and they're also not structured in full sentences. The library flair with ner-spanish-large language model does work in extracting these names.
To Reproduce I have created a fake document with the name: Inés Santamaría and it is not detected using spacy. Apologies that we can't provide our real data as that would surely be more helpful.
I saved in txt form as I couldn't upload html here fake_form.txt
Expected behavior Inés Santamaría to be extracted as a "PER" entity from spaCy as it is from flair.
Aleph version 3.17.0
Additional context Consistently misses detection of names on every single doc. These names usually aren't detected by spaCy on local run.