keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Add a vocabulary_size argument to WordPieceTokenizer

Open mattdangerw opened this issue 2 years ago • 18 comments

We should add a vocabulary_size argument to the WordPieceTokenizer layer that forces the vocabulary size by truncating the passed in vocabulary if necessary.

Potential docstring:

        vocabulary_size: Force the vocabulary to be exactly `vocabulary_size`,
            by truncating the input vocabulary if necessary. This is not
            equivalent to retraining a word piece vocabulary from scratch, but
            can be useful for quick hyperparameter tuning.

mattdangerw avatar Apr 30 '22 00:04 mattdangerw

Some other notes:

  • If the vocabulary_size argument is passed, calling layer.vocabulary_size() should always match what was passed.
  • If the vocabulary file is shorted that the forced vocabulary size, we can log a warning. Warning: Setting vocab size to a larger value than the input vocabulary file. Some token ids will never be output from the tokenizer.

mattdangerw avatar Apr 30 '22 00:04 mattdangerw

hi @mattdangerw, would love to work on this.

blackhat-coder avatar May 03 '22 10:05 blackhat-coder

Thank you!

mattdangerw avatar May 03 '22 17:05 mattdangerw

@blackhat-coder Any updates on this? This would actually be a useful hyperparmeter to tune in our first guide that could help reduce training time.

mattdangerw avatar May 18 '22 23:05 mattdangerw

I'm sorry for the delay, been busy with school work. Would send in a PR asap.

blackhat-coder avatar May 19 '22 10:05 blackhat-coder

Thank you! Let me know if there are any question I can help with.

mattdangerw avatar May 19 '22 23:05 mattdangerw

Hi @mattdangerw , how do I run tests ? currently written testcases

blackhat-coder avatar May 21 '22 15:05 blackhat-coder

Check out the environment and test running sections of our contributing guide.

https://github.com/keras-team/keras-nlp/blob/master/CONTRIBUTING.md#setting-up-an-environment

If something is broken or unclear there, let us know!

mattdangerw avatar May 24 '22 03:05 mattdangerw

Hey @blackhat-coder, are you still working on this?

aflah02 avatar Sep 05 '22 06:09 aflah02

Hey @mattdangerw, can I contribute to this?

lordgavy01 avatar Dec 28 '22 11:12 lordgavy01

Hey I would like to take this,

jayam30 avatar Feb 23 '23 15:02 jayam30

I would like to contribute to this

atharvapurdue avatar Feb 24 '23 20:02 atharvapurdue

@mattdangerw, I would like to contribute on this

sahusiddharth avatar Jul 29 '23 17:07 sahusiddharth

@jbischof

Context: inside the constructor(init) of WordPieceTokenizer we can assign the vocabulary in one of two ways

1.making changes in the code below, reading from file till the vocabulary size

if isinstance(vocabulary, str):
       self.vocabulary = [
                line.rstrip() for line in tf.io.gfile.GFile(vocabulary)
       ]
       elif isinstance(vocabulary, Iterable):
            # Make a copy.
            self.vocabulary = list(vocabulary)
       else:
            raise ValueError(
                "Vocabulary must be an file path or list of terms. "
                f"Received: vocabulary={vocabulary}"
            )
  1. let it do it's thing here and make changes in all the functions and only passing the the slice of vocabulary list. something like below
def get_vocabulary(self) -> List[str]:
        """Get the tokenizer vocabulary as a list of strings tokens."""
        return self.vocabulary[:self.vocabulary_size]

something similar done in all the functions

Question: Can you give me some suggestions on how should I move forward?

sahusiddharth avatar Aug 02 '23 07:08 sahusiddharth

@sahusiddharth you probably want approach #1, otherwise there's no way to use this argument to load a vocabulary file too large to fit in memory.

jbischof avatar Aug 03 '23 00:08 jbischof

@jbischof

Context: Task is to add an argument named vocabulary _size but there is a method with the same name in WordPieceTokenizer

def vocabulary_size(self) -> int:
         """Get the size of the tokenizer vocabulary."""
         return len(self.vocabulary)

Can you suggest a name other than vocabulary_size?

sahusiddharth avatar Aug 17 '23 15:08 sahusiddharth

hi there is this issue still open ,i love to conteibute init

man0045 avatar Nov 13 '23 13:11 man0045

Hi @man0045,

this issue is already closed.

sahusiddharth avatar Nov 13 '23 13:11 sahusiddharth