malaya icon indicating copy to clipboard operation
malaya copied to clipboard

list index out of range while normalizing

Open AetherPrior opened this issue 2 years ago • 5 comments

Upon running the following in a notebook:

corrector = malaya.spell.probability()
normalizer = malaya.normalize.normalizer(corrector, date=False,time=False)
normalizer.normalize('Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070',normalize_entity=False, normalize_url = True)

I get this error:

IndexError: list index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_1818/46719394.py in <module>
----> 1 normalizer.normalize('Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070',normalize_entity=False, normalize_url = False)

~/Documents/virtualenvs/mPunct/lib/python3.9/site-packages/herpetologist/__init__.py in check(*args, **kwargs)
     98                 nested_check(v, p)
     99 
--> 100         return func(*args, **kwargs)
    101 
    102     return check

~/Documents/virtualenvs/mPunct/lib/python3.9/site-packages/malaya/normalize.py in normalize(self, string, check_english, normalize_text, normalize_entity, normalize_url, normalize_email, normalize_year, normalize_telephone)
    473                     splitted = word.split('-')
    474                     left = put_spacing_num(splitted[0])
--> 475                     right = put_spacing_num(splitted[1])
    476                     word = f'{left}, {right}'
    477                 result.append(word)

IndexError: list index out of range

Expected Behavior: Either the URL Is normalized or it's left as it is due to the bad syntax.

PS: This data is from the IIUM-Confessions dump.

AetherPrior avatar Oct 09 '21 07:10 AetherPrior

Want to try to fix it? I might a bit late to fix it, busy other stuffs.

huseinzol05 avatar Oct 11 '21 11:10 huseinzol05

I could try! Although I might take some time to get around it too.
Will keep you posted if I fix this.

AetherPrior avatar Oct 11 '21 12:10 AetherPrior

Thanks! No pressure, take ur time

huseinzol05 avatar Oct 17 '21 02:10 huseinzol05

It took far too long to address, but it got on my nerves enough to try today! ^_^

AetherPrior avatar Nov 20 '21 16:11 AetherPrior

If you look at tokenizer,

import malaya
corrector = malaya.spell.probability()
normalizer = malaya.normalize.normalizer(corrector, date=False,time=False)
s = 'Gambar ni membantu. Gambar tutorial >>. facebook. com/story. story_fbid=10206183032200965&id=1418962070'
normalizer._tokenizer(s)

output,

['Gambar',
 'ni',
 'membantu',
 '.',
 'Gambar',
 'tutorial',
 '>',
 '>',
 '.',
 'facebook',
 '.',
 'com',
 '/',
 'story',
 '.',
 'story_fbid',
 '=',
 '10206183032200965',
 '&',
 'id',
 '=',
 '1418962070']

This is very hard task to check the url, let me try to play around with the regex, but cannot confirm can solve or not.

huseinzol05 avatar Nov 30 '21 07:11 huseinzol05