Support for UniCase encoding
I wanted to tokenise my text in a manner similar to that described in this paper - https://arxiv.org/pdf/2010.11936.pdf
In particular use Unigram, but add a CamelCase pretokeniser along with WhiteSpace. However I also want to lower case all the tokens after pre-tokenisation. I don't think this is currently possible as Lowercase must appear before Split in the pipeline since it's a normaliser, the actual splitting is easy using one of the expressions from https://stackoverflow.com/questions/1128305/regex-for-pascalcased-words-aka-camelcased-with-leading-uppercase-letter and the Split pre-tokenizer.
Any specific reason for closing the issue? Did you manage to do what you wanted?
No I didn't mean to close it. I don't see a way of reopening (at least using the phone app)?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.