prose Use own tokenizer?

Use own tokenizer?

Open SUP3RIA opened this issue 6 years ago • 2 comments

I have a list of smiley codes eg :lough: which gets tokenized to ":lough" and ":". How can I solve that " :lough:" will be " :lough:"?

Sep 22 '18 20:09 SUP3RIA

Unfortunately, as I alluded to in #30, making the document-creation process extensible wasn't something that I was able to accomplish with v2.0.0. So, there's really no "good" way of customizing tokenization at the moment.

Of course (depending on your specific needs), you could add a preprocessing step that replaces all instances of :lough: (and the like) with a place-holder token like EMOJI (where EMOJI is a unique identifier for a particular smiley code—i.e., its name) that won't be split during tokenization.

Sep 23 '18 01:09 jdkato

that won't be split during tokenization.

contact me cause e-s CROW ;)

Sep 24 '18 09:09 rollins123

prose prose copied to clipboard

Use own tokenizer?

prose
prose copied to clipboard