malaya
malaya copied to clipboard
list index out of range while normalizing
Upon running the following in a notebook:
corrector = malaya.spell.probability()
normalizer = malaya.normalize.normalizer(corrector, date=False,time=False)
normalizer.normalize('Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070',normalize_entity=False, normalize_url = True)
I get this error:
IndexError: list index out of range
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/tmp/ipykernel_1818/46719394.py in <module>
----> 1 normalizer.normalize('Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070',normalize_entity=False, normalize_url = False)
~/Documents/virtualenvs/mPunct/lib/python3.9/site-packages/herpetologist/__init__.py in check(*args, **kwargs)
98 nested_check(v, p)
99
--> 100 return func(*args, **kwargs)
101
102 return check
~/Documents/virtualenvs/mPunct/lib/python3.9/site-packages/malaya/normalize.py in normalize(self, string, check_english, normalize_text, normalize_entity, normalize_url, normalize_email, normalize_year, normalize_telephone)
473 splitted = word.split('-')
474 left = put_spacing_num(splitted[0])
--> 475 right = put_spacing_num(splitted[1])
476 word = f'{left}, {right}'
477 result.append(word)
IndexError: list index out of range
Expected Behavior: Either the URL Is normalized or it's left as it is due to the bad syntax.
PS: This data is from the IIUM-Confessions dump.
Want to try to fix it? I might a bit late to fix it, busy other stuffs.
I could try! Although I might take some time to get around it too.
Will keep you posted if I fix this.
Thanks! No pressure, take ur time
It took far too long to address, but it got on my nerves enough to try today! ^_^
If you look at tokenizer,
import malaya
corrector = malaya.spell.probability()
normalizer = malaya.normalize.normalizer(corrector, date=False,time=False)
s = 'Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070'
normalizer._tokenizer(s)
output,
['Gambar',
'ni',
'membantu',
'.',
'Gambar',
'tutorial',
'>',
'>',
'.',
'facebook',
'.',
'com',
'/',
'story',
'.',
'story_fbid',
'=',
'10206183032200965',
'&',
'id',
'=',
'1418962070']
This is very hard task to check the url, let me try to play around with the regex, but cannot confirm can solve or not.