truecase icon indicating copy to clipboard operation
truecase copied to clipboard

Possible Feature: Use lambda function for out_of_vocabulary_token_option

Open keshprad opened this issue 3 years ago • 3 comments

Let me know what you think of allowing users to specify their own lambda func if they aren't satisfied with the out of vocab options.

I can work on this in my fork and create a PR.

keshprad avatar Jul 03 '21 16:07 keshprad

Yes, we can do that.

I would prefer extracting the logic to a member function out_of_vocabulary_handler and adding instructions to the readme on how users can override it with their own custom implementation.

What do you think?

daltonfury42 avatar Jul 03 '21 16:07 daltonfury42

Yes, that's good. I'll work up an implementation.

In addition to out_of_vocabulary, an out_of_dictionary option could also be useful in a later update. This could be an early-stage way to differentiate between names and words that are simply not in the vocabulary

For example: "hip-hop" is not in vocabulary, but is certainly a word. I would want it in lowercase.

However, my name (Keshav) is not in the vocabulary and won't be found in a dictionary. I'd want to capitalize "Keshav."

This certainly won't work for all names, as some names are words in the dictionary. eg: "Trump"

keshprad avatar Jul 03 '21 17:07 keshprad

Another thing to consider: If the first word is classified as "out_of_vocabulary", then should we capitalize it, or just go along with the user's out_of_vocabulary_token_option.

Currently, it is the latter; however, I think we should capitalize it.

keshprad avatar Jul 03 '21 17:07 keshprad