pragmatic_tokenizer
pragmatic_tokenizer copied to clipboard
feature overlap with pragmatic_segmenter?
Currently there is some overlap between pragmatic_tokenizer and pragmatic_segmenter, as both e.g. handle abbreviations. Should rules and constants (especially when language specific) that are shared between both gems be extracted into a sub-gem? Or is there too little shared code to justify this?
And/or: should constant arrays and hashes be converted from ruby to .yml files? Maybe it's possible that the app will then only load them once, even if two gems use them?
I'd definitely be open to this if it reduced memory, improved the speed or made it easier to maintain the gems. This one is not high on my priority list right now but would be of course be open to pull requests.