SkillNER
SkillNER copied to clipboard
IndexError: list index out of range
Some strings make the annotate function crash:
import spacy
from spacy.matcher import PhraseMatcher
# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor
# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)
skill_extractor.annotate("Learn how to become a professional wedding makeup artist")
If you run the code above you should get the following error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[69], line 1
----> 1 skill_extractor.annotate("Learn how to become a professional wedding makeup artist")
File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
124 text_obj, self.matchers['abv_matcher'])
126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
127 text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
130 text_obj, self.matchers['low_form_matcher'])
132 skills_on_token = self.skill_getters.get_token_match_skills(
133 text_obj, self.matchers['token_matcher'])
134 full_sk = skills_full + skills_abv
File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
329 for match_id, start, end in matcher(doc):
330 id_ = matcher.vocab.strings[match_id]
--> 332 if text_obj[start].is_matchable:
333 skills.append({'skill_id': id_+'_lowSurf',
334 'doc_node_value': str(doc[start:end]),
335 'doc_node_id': list(range(start, end)),
336 'type': 'lw_surf'})
338 return skills, text_obj
File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
277 def __getitem__(
278 self,
279 index: int
280 ) -> Word:
281 """To get the word at the specified position by index
282
283 Parameters
(...)
302 english
303 """
--> 304 return self.list_words[index]
IndexError: list index out of range
Running into the same problem. Any way to maybe sanitize the string to not run into this problem?
Seems to be a problem with some unicode characters. Encoding to ascii and then decoding back to utf-8 works.
import unicodedata
...
text = "My Random Character text"
text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
annotations = skill_extractor.annotate(text )
I am still running in the same issue using the encoding/decoding:
import spacy
from spacy.matcher import PhraseMatcher
import unicodedata
# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor
# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)
text = "Learn how to become a professional wedding makeup artist"
text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
annotations = skill_extractor.annotate(text )
I still get the same error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[2], line 4
2 text = "Learn how to become a professional wedding makeup artist"
3 text = unicodedata.normalize('NFKD', text ).encode('ascii', 'ignore').decode("utf-8")
----> 4 annotations = skill_extractor.annotate(text )
File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
124 text_obj, self.matchers['abv_matcher'])
126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
127 text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
130 text_obj, self.matchers['low_form_matcher'])
132 skills_on_token = self.skill_getters.get_token_match_skills(
133 text_obj, self.matchers['token_matcher'])
134 full_sk = skills_full + skills_abv
File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
329 for match_id, start, end in matcher(doc):
330 id_ = matcher.vocab.strings[match_id]
--> 332 if text_obj[start].is_matchable:
333 skills.append({'skill_id': id_+'_lowSurf',
334 'doc_node_value': str(doc[start:end]),
335 'doc_node_id': list(range(start, end)),
336 'type': 'lw_surf'})
338 return skills, text_obj
File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/python_projects/skillNER/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
277 def __getitem__(
278 self,
279 index: int
280 ) -> Word:
281 """To get the word at the specified position by index
282
283 Parameters
(...)
302 english
303 """
--> 304 return self.list_words[index]
IndexError: list index out of range
Facing this issue as well. Did you ever find a solve @Jibril-Frej ?
No real fix. I just do a try catch.
try:
skill_extractor.annotate(target_text)
except IndexError:
pass
except ValueError:
pass
I am also encountering this error. I would really like to use SkillNER but this issue is really preventing me from being able to do so.