Botok understanding custom pipelines

In the below toy example, my expectation is to achieve a tokenized version of the input text. With the below code, the result is a list of tokens, but tokens are syllables only.

from botok import Trie, BoSyl, Tokenize, Config, TokChunks

in_str = '༈ བློ་ཆོས་སུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ཆོས་ལམ་དུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ལམ་འཁྲུལ་བ་ཞིག་པར་བྱིན་གྱིས་རློབས། །འཁྲུལ་པ་ཡེ་ཤེས་སུ་འཆར་བར་བྱིན་གྱིས་རློབས། །'

profile = "empty"
config = Config()
trie = Trie(BoSyl, profile, config, [])
tok = Tokenize(trie)
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.tokenize(preproc)

out = []

for i in range(len(tokens)):

    out.append(tokens[i]['text'])
    
out

How can I change the above code to achieve what I'm trying to achieve?

Jul 28 '20 12:07 mikkokotila

Yes, I suggest you to use the latest version of botok. We have simplified the botok config, which in turn simplifies building custom pipelines.

In the latest version we have introduced dialect packs, which are similar to various profiles in previous version, but they do a bit more than profile. Basically each dialect pack will have two components, Adjustments and Dictionary.

Dictionary component contains all the standardized word list and rules (to adjust segmentation) for the tokenization and Adjustments is for researching and testing the segmentation and its content will eventually be included in the Dictionary component. Adjustment can also be used for customizing the default tokenization.

As far as above toy example to get the expected output is concerned, the correct version is here.

from botok import BoSyl, Config, TokChunks, Tokenize, Trie

in_str = "༈ བློ་ཆོས་སུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ཆོས་ལམ་དུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ལམ་འཁྲུལ་བ་ཞིག་པར་བྱིན་གྱིས་རློབས། །འཁྲུལ་པ་ཡེ་ཤེས་སུ་འཆར་བར་བྱིན་གྱིས་རློབས། །"

config = Config()
trie = Trie(
    BoSyl,
    profile=config.profile,
    main_data=config.dictionary,
    custom_data=config.adjustments,
    pickle_path=config.dialect_pack_path.parent,
)
tok = Tokenize(trie)
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.tokenize(preproc)

out = []

for i in range(len(tokens)):

    out.append(tokens[i]["text"])

print(out)

Output:

['༈ ', 'བློ་', 'ཆོས་', 'སུ་', 'འགྲོ་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'ཆོས་', 'ལམ་', 'དུ་', 'འགྲོ་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'ལམ་', 'འཁྲུལ་བ་', 'ཞིག་པར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'འཁྲུལ་', 'པ་', 'ཡེ་ཤེས་', 'སུ་', 'འཆར་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །']

You can refer to https://github.com/Esukhia/botok/blob/7d85cbb0df62ff4c9da3c70088ad671f03472a18/botok/tokenizers/wordtokenizer.py#L28 class to customize the adjustment rules too.

PS: We will be releasing botok documentation soon.

Jul 29 '20 13:07 10zinten

How wonderful!

For dialect packs, the use is 100% on Buddhadharma texts. Do you have recommendation for which dialect packs to use?

Jul 30 '20 13:07 mikkokotila

Currently, we only have dialect pack for general Tibetan language. Our researcher team is working on dialect pack for Buddhadharma texts. Till then, you can experiment with general dialect pack to improve the segmentation.

We will be releasing a detail documentation on customizing any dialect pack.

Jul 31 '20 12:07 10zinten