tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Support for UniCase encoding

Open david-waterworth opened this issue 4 years ago • 3 comments

I wanted to tokenise my text in a manner similar to that described in this paper - https://arxiv.org/pdf/2010.11936.pdf

In particular use Unigram, but add a CamelCase pretokeniser along with WhiteSpace. However I also want to lower case all the tokens after pre-tokenisation. I don't think this is currently possible as Lowercase must appear before Split in the pipeline since it's a normaliser, the actual splitting is easy using one of the expressions from https://stackoverflow.com/questions/1128305/regex-for-pascalcased-words-aka-camelcased-with-leading-uppercase-letter and the Split pre-tokenizer.

david-waterworth avatar Jan 16 '21 08:01 david-waterworth

Any specific reason for closing the issue? Did you manage to do what you wanted?

n1t0 avatar Jan 17 '21 13:01 n1t0

No I didn't mean to close it. I don't see a way of reopening (at least using the phone app)?

david-waterworth avatar Jan 17 '21 20:01 david-waterworth

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 25 '24 01:04 github-actions[bot]