Missing English words at the end of the text during sentence tokenization

Open BLKSerene opened this issue 2 years ago • 0 comments

Hi, when sentence tokenizing Tibetan text with English (or Non-Tibetan?) words at the end of the text, the ending English part is missing from the results of sentence tokenization.

Another issue is that the norm_sent property of the returned tokenized sentence seem to not represent English text properly, adding ་ - between each English words and between English and Tibetan text.

Example

>>> import botok
>>> tokenizer = botok.WordTokenizer()
Loading Trie... (1s.)
>>> text = 'Test this Tibetan string: དུང་དང་འོ་མར་འགྲན་པའི་ལྷག་བསམ་མཐུ། །དམན་ཡང་དཀར་པོའི་བྱས་འབྲས་ཅུང་ཟད་ཅིག །བློ་དང་འདུན་པ་བཟང་བའི་རང་རིགས་ཀུན། །རྒྱལ་ཁའི་འཕྲིན་བཟང་ལས་དོན་འགྲུབ་ཕྱིར་འབད།།. Does detokenization work as expected?'
>>> tokens = tokenizer.tokenize(text)
>>> [token.text for token in tokens] # No problem with word tokenization
['Test this Tibetan string: ', 'དུང་', 'དང་', 'འོ་མ', 'ར་', 'འགྲན་པ', 'འི་', 'ལྷག་བསམ་', 'མཐུ', '། །', 'དམན་', 'ཡང་', 'དཀར་པོ', 'འི་', 'བྱས་འབྲས་', 'ཅུང་ཟད་', 'ཅིག', ' །', 'བློ་', 'དང་', 'འདུན་པ་', 'བཟང་བ', 'འི་', 'རང་རིགས་', 'ཀུན', '། །', 'རྒྱལ་ཁ', 'འི་', 'འཕྲིན་', 'བཟང་', 'ལས་དོན་', 'འགྲུབ་', 'ཕྱིར་', 'འབད', '།།', '. Does detokenization work as expected?']

>>> for sentence_tokens in botok.sentence_tokenizer(tokens):
>>>     print(''.join([sentence_token.text for sentence_token in sentence_tokens['tokens']]))
                    
Test this Tibetan string: དུང་དང་འོ་མར་འགྲན་པའི་ལྷག་བསམ་མཐུ། །དམན་ཡང་དཀར་པོའི་བྱས་འབྲས་ཅུང་ཟད་ཅིག །
བློ་དང་འདུན་པ་བཟང་བའི་རང་རིགས་ཀུན། །རྒྱལ་ཁའི་འཕྲིན་བཟང་ལས་དོན་འགྲུབ་ཕྱིར་འབད།།

>>> for sentence in botok.sentence_tokenizer(tokens):
>>>     print(sentence['norm_sent'])
Test་ -this་ -Tibetan་ -string:་ -དུང་ དང་ འོ་མ་ -ར་ འགྲན་པ་ -འི་ ལྷག་བསམ་ མཐུ་ ། ། དམན་ ཡང་ དཀར་པོ་ -འི་ བྱས་འབྲས་ ཅུང་ཟད་ ཅིག་ །
བློ་ དང་ འདུན་པ་ བཟང་བ་ -འི་ རང་རིགས་ ཀུན་ ། ། རྒྱལ་ཁ་ -འི་ འཕྲིན་ བཟང་ ལས་དོན་ འགྲུབ་ ཕྱིར་ འབད་ །།

Expected output (missing part styled as bold)

Test this Tibetan string: དུང་དང་འོ་མར་འགྲན་པའི་ལྷག་བསམ་མཐུ། །དམན་ཡང་དཀར་པོའི་བྱས་འབྲས་ཅུང་ཟད་ཅིག ། བློ་དང་འདུན་པ་བཟང་བའི་རང་རིགས་ཀུན། །རྒྱལ་ཁའི་འཕྲིན་བཟང་ལས་དོན་འགྲུབ་ཕྱིར་འབད།།. Does detokenization work as expected?

Environment

OS: Windows 11 x64 Python: 3.10.13 x64 botok: 0.8.12

Sep 25 '23 18:09 BLKSerene