tokenizers Support for UniCase encoding

I wanted to tokenise my text in a manner similar to that described in this paper - https://arxiv.org/pdf/2010.11936.pdf

In particular use Unigram, but add a CamelCase pretokeniser along with WhiteSpace. However I also want to lower case all the tokens after pre-tokenisation. I don't think this is currently possible as Lowercase must appear before Split in the pipeline since it's a normaliser, the actual splitting is easy using one of the expressions from https://stackoverflow.com/questions/1128305/regex-for-pascalcased-words-aka-camelcased-with-leading-uppercase-letter and the Split pre-tokenizer.

Jan 16 '21 08:01 david-waterworth

Any specific reason for closing the issue? Did you manage to do what you wanted?

Jan 17 '21 13:01 n1t0

No I didn't mean to close it. I don't see a way of reopening (at least using the phone app)?

Jan 17 '21 20:01 david-waterworth

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 25 '24 01:04 github-actions[bot]