malaya list index out of range while normalizing

list index out of range while normalizing

Open AetherPrior opened this issue 2 years ago • 5 comments

Upon running the following in a notebook:

corrector = malaya.spell.probability()
normalizer = malaya.normalize.normalizer(corrector, date=False,time=False)
normalizer.normalize('Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070',normalize_entity=False, normalize_url = True)

I get this error:

IndexError: list index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_1818/46719394.py in <module>
----> 1 normalizer.normalize('Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070',normalize_entity=False, normalize_url = False)

~/Documents/virtualenvs/mPunct/lib/python3.9/site-packages/herpetologist/__init__.py in check(*args, **kwargs)
     98                 nested_check(v, p)
     99 
--> 100         return func(*args, **kwargs)
    101 
    102     return check

~/Documents/virtualenvs/mPunct/lib/python3.9/site-packages/malaya/normalize.py in normalize(self, string, check_english, normalize_text, normalize_entity, normalize_url, normalize_email, normalize_year, normalize_telephone)
    473                     splitted = word.split('-')
    474                     left = put_spacing_num(splitted[0])
--> 475                     right = put_spacing_num(splitted[1])
    476                     word = f'{left}, {right}'
    477                 result.append(word)

IndexError: list index out of range

Expected Behavior: Either the URL Is normalized or it's left as it is due to the bad syntax.

PS: This data is from the IIUM-Confessions dump.

Oct 09 '21 07:10 AetherPrior

Want to try to fix it? I might a bit late to fix it, busy other stuffs.

Oct 11 '21 11:10 huseinzol05

I could try! Although I might take some time to get around it too.
Will keep you posted if I fix this.

Oct 11 '21 12:10 AetherPrior

Thanks! No pressure, take ur time

Oct 17 '21 02:10 huseinzol05

It took far too long to address, but it got on my nerves enough to try today! ^_^

Nov 20 '21 16:11 AetherPrior

If you look at tokenizer,

import malaya
corrector = malaya.spell.probability()
normalizer = malaya.normalize.normalizer(corrector, date=False,time=False)
s = 'Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070'
normalizer._tokenizer(s)

output,

['Gambar',
 'ni',
 'membantu',
 '.',
 'Gambar',
 'tutorial',
 '>',
 '>',
 '.',
 'facebook',
 '.',
 'com',
 '/',
 'story',
 '.',
 'story_fbid',
 '=',
 '10206183032200965',
 '&',
 'id',
 '=',
 '1418962070']

This is very hard task to check the url, let me try to play around with the regex, but cannot confirm can solve or not.

Nov 30 '21 07:11 huseinzol05

malaya malaya copied to clipboard

list index out of range while normalizing

malaya
malaya copied to clipboard