keras-nlp
keras-nlp copied to clipboard
Add a vocabulary_size argument to WordPieceTokenizer
We should add a vocabulary_size
argument to the WordPieceTokenizer layer that forces the vocabulary size by truncating the passed in vocabulary if necessary.
Potential docstring:
vocabulary_size: Force the vocabulary to be exactly `vocabulary_size`,
by truncating the input vocabulary if necessary. This is not
equivalent to retraining a word piece vocabulary from scratch, but
can be useful for quick hyperparameter tuning.
Some other notes:
- If the
vocabulary_size
argument is passed, callinglayer.vocabulary_size()
should always match what was passed. - If the vocabulary file is shorted that the forced vocabulary size, we can log a warning.
Warning: Setting vocab size to a larger value than the input vocabulary file. Some token ids will never be output from the tokenizer.
hi @mattdangerw, would love to work on this.
Thank you!
@blackhat-coder Any updates on this? This would actually be a useful hyperparmeter to tune in our first guide that could help reduce training time.
I'm sorry for the delay, been busy with school work. Would send in a PR asap.
Thank you! Let me know if there are any question I can help with.
Hi @mattdangerw , how do I run tests ? currently written testcases
Check out the environment and test running sections of our contributing guide.
https://github.com/keras-team/keras-nlp/blob/master/CONTRIBUTING.md#setting-up-an-environment
If something is broken or unclear there, let us know!
Hey @blackhat-coder, are you still working on this?
Hey @mattdangerw, can I contribute to this?
Hey I would like to take this,
I would like to contribute to this
@mattdangerw, I would like to contribute on this
@jbischof
Context: inside the constructor(init) of WordPieceTokenizer we can assign the vocabulary in one of two ways
1.making changes in the code below, reading from file till the vocabulary size
if isinstance(vocabulary, str):
self.vocabulary = [
line.rstrip() for line in tf.io.gfile.GFile(vocabulary)
]
elif isinstance(vocabulary, Iterable):
# Make a copy.
self.vocabulary = list(vocabulary)
else:
raise ValueError(
"Vocabulary must be an file path or list of terms. "
f"Received: vocabulary={vocabulary}"
)
- let it do it's thing here and make changes in all the functions and only passing the the slice of vocabulary list. something like below
def get_vocabulary(self) -> List[str]:
"""Get the tokenizer vocabulary as a list of strings tokens."""
return self.vocabulary[:self.vocabulary_size]
something similar done in all the functions
Question: Can you give me some suggestions on how should I move forward?
@sahusiddharth you probably want approach #1, otherwise there's no way to use this argument to load a vocabulary file too large to fit in memory.
@jbischof
Context: Task is to add an argument named vocabulary _size but there is a method with the same name in WordPieceTokenizer
def vocabulary_size(self) -> int:
"""Get the size of the tokenizer vocabulary."""
return len(self.vocabulary)
Can you suggest a name other than vocabulary_size?
hi there is this issue still open ,i love to conteibute init