prose
prose copied to clipboard
Use own tokenizer?
I have a list of smiley codes eg :lough: which gets tokenized to ":lough" and ":". How can I solve that " :lough:" will be " :lough:"?
Unfortunately, as I alluded to in #30, making the document-creation process extensible wasn't something that I was able to accomplish with v2.0.0. So, there's really no "good" way of customizing tokenization at the moment.
Of course (depending on your specific needs), you could add a preprocessing step that replaces all instances of :lough:
(and the like) with a place-holder token like EMOJI
(where EMOJI
is a unique identifier for a particular smiley code—i.e., its name) that won't be split during tokenization.
that won't be split during tokenization.
contact me cause e-s CROW ;)