Sentencize a list of tokens that have been manually tokenized by adding spaces

Open BLKSerene opened this issue 6 years ago • 1 comments

Hi, I'm wondering whether it is possible to conduct sentence tokenization on a list of tokens that have already been tokenized (without breaking the original word tokenization)?

I tried the answer in #38, but it seems that it no longer works in pybo 0.6.4.

>>> text = 'བཀུར་བ ར་ མི་འགྱུར་ ཞིང༌ ། ། བརྙས་བཅོས་ མི་ སྙན་ རྗོད་པ ར་ བྱེད ། ། དབང་ དང་ འབྱོར་པ་ ལྡན་པ་ ཡི ། ། རྒྱལ་རིགས་ ཕལ་ཆེ ར་ བག་མེད་པ ས ། ། མྱོས་པ འི་ གླང་ཆེན་ བཞིན་ དུ་ འཁྱམས ། ། དེ་ ཡི་ འཁོར་ ཀྱང་ དེ་ འདྲ ར་ འགྱུར ། ། གཞན་ ཡང་ རྒྱལ་པོ་ རྒྱལ་རིགས་ ཀྱི ། ། སྤྱོད་པ་ བཟང་ངན་ ཅི་འདྲ་བ ། ། དེ་ འདྲ འི་ ཚུལ་ ལ་ བལྟས་ ནས་ སུ ། ། འབངས་ རྣམས་ དེ་དང་དེ་ འདྲ་ སྟེ ། ། རྒྱལ་པོ་ ནོ ར་ ལ་ བརྐམས་ གྱུར་ ན ། ། ནོ ར་ གྱིས་ རྒྱལ་ཁྲིམས་ བསླུ་བ ར་ རྩོམ ། ། མི་བདག་ གཡེམ་ ལ་ དགའ་ གྱུར་ ན ། ། འཕྱོན་མ འི་ ཚོགས་ རྣམས་ མགོ་འཕང་ མཐོ ། ། ཕྲ་མ ར་ ཉན་ ན་ དབྱེན་ གྱིས་ གཏོར ། ། བརྟག་དཔྱད་ མི་ ཤེས་ རྫུན་ གྱིས་ སླུ ། ། ང་ ལོ་ ཡང་ན་ ཀུན་ གྱིས་ བསྐྱོད ། ། ངོ་དག ར་ བརྩི་ ན་ ཟོལ་ཚིག་ སྨྲ ། ། དེ་དང་དེ་ ལ་སོགས་པ་ ཡི ། ། མི་བདག་ དེ་ ལ་ གང་ གང་ གིས ། ། བསླུ་བ ར་ རུང་བ འི་ སྐབས་ མཐོང་ ན ། ། གཡོན་ཅན་ ཚོགས་ ཀྱིས་ ཐབས་ དེ་ སེམས ། ། མི་ རྣམས་ རང་འདོད་ སྣ་ཚོགས་ ལ ། ། རྒྱལ་པོ་ ཀུན་ གྱི་ ཐུན་མོང་ ཕྱིར ། ། རྒྱལ་པོས་ བསམ་ གཞིགས་ མ་ བྱས་ ན ། ། ཐ་མ ར་ རྒྱལ་སྲིད་ འཇིག་པ ར་ འགྱུར ། ། ཆེན་པོ འི་ གོ་ས ར་ གནས་པ་ ལ ། ། སྐྱོན་ ཀྱང་ ཡོན་ཏན་ ཡིན་ཚུལ་ དུ ། ། འཁོར་ ངན་ རྣམས་ ཀྱིས་ ངོ་བསྟོད་ སྨྲ ། ། དེ་ཕྱིར་ སྐྱོན་ཡོན་ ཤེས་པ་ དཀའ ། ། ལྷག་པ ར་ རྩོད་ལྡན་ སྙིགས་མ འི་ ཚེ ། ། འཁོར་ གྱི་ ནང་ ན་མ་ རབས་ མང༌ ། ། སྐྱོན་ ཡང་ ཡོན་ཏན་ ལྟར་ མཐོང་ ལ ། ། རང་འདོད་ ཆེ་ ཞིང་ རྒྱལ་པོ་ བསླུ ། ། ཆུས་ དང་ འཁོར་ གྱི་ བདེ་ ཐབས་ ལ ། ། བསམ་ གཞིགས་ བྱེད་པ་ དཀོན་པ འི་ ཕྱིར ། ། རྒྱལ་པོས་ ལེགས་པ ར་ དཔྱད་ ནས་ སུ ། ། བདེན་པ འི་ ངག་ ལས'
>>> tokens = text.split()
>>> pybo.sentence_tokenizer(tokens)
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    pybo.sentence_tokenizer(tokens)
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 16, in sentence_tokenizer
    sent_indices = get_sentence_indices(tokens)
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 43, in get_sentence_indices
    sentence_idx = extract_chunks(is_endpart_n_punct, tokens, 0, previous_end)
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 142, in extract_chunks
    if test(subtokens[n - 1], token):
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 80, in is_endpart_n_punct
    return is_ending_part(token1) and token2.chunk_type == 'PUNCT'
  File "D:\Python\lib\site-packages\pybo\tokenizers\sentencetokenizer.py", line 75, in is_ending_part
    return token and token.pos == 'PART' \
AttributeError: 'str' object has no attribute 'pos'

Aug 16 '19 02:08 BLKSerene

The error comes from the fact you're feeding the sentence_tokenizer() a list of strings whereas it is expecting a list of Token objects – which would have attributes such as 'pos'.

It is theoretically feasible to turn a list of words into tokens without breaking the original tokenization, but lots of information dynamically derived from the trie will be lost. So I don't really know how useful that would be. Specially for sentence tokenization that runs heuristics on the content of Token objects.

I may give it a try since I have received requests from others too. I'll keep you posted

Aug 16 '19 09:08 drupchen