Add a vocabulary_size argument to WordPieceTokenizer

Open mattdangerw opened this issue 2 years ago • 18 comments

We should add a vocabulary_size argument to the WordPieceTokenizer layer that forces the vocabulary size by truncating the passed in vocabulary if necessary.

Potential docstring:

        vocabulary_size: Force the vocabulary to be exactly `vocabulary_size`,
            by truncating the input vocabulary if necessary. This is not
            equivalent to retraining a word piece vocabulary from scratch, but
            can be useful for quick hyperparameter tuning.

Apr 30 '22 00:04 mattdangerw

Some other notes:

If the vocabulary_size argument is passed, calling layer.vocabulary_size() should always match what was passed.
If the vocabulary file is shorted that the forced vocabulary size, we can log a warning. Warning: Setting vocab size to a larger value than the input vocabulary file. Some token ids will never be output from the tokenizer.

Apr 30 '22 00:04 mattdangerw

hi @mattdangerw, would love to work on this.

May 03 '22 10:05 blackhat-coder

Thank you!

May 03 '22 17:05 mattdangerw

@blackhat-coder Any updates on this? This would actually be a useful hyperparmeter to tune in our first guide that could help reduce training time.

May 18 '22 23:05 mattdangerw

I'm sorry for the delay, been busy with school work. Would send in a PR asap.

May 19 '22 10:05 blackhat-coder

Thank you! Let me know if there are any question I can help with.

May 19 '22 23:05 mattdangerw

Hi @mattdangerw , how do I run tests ? currently written testcases

May 21 '22 15:05 blackhat-coder

Check out the environment and test running sections of our contributing guide.

https://github.com/keras-team/keras-nlp/blob/master/CONTRIBUTING.md#setting-up-an-environment

If something is broken or unclear there, let us know!

May 24 '22 03:05 mattdangerw

Hey @blackhat-coder, are you still working on this?

Sep 05 '22 06:09 aflah02

Hey @mattdangerw, can I contribute to this?

Dec 28 '22 11:12 lordgavy01

Hey I would like to take this,

Feb 23 '23 15:02 jayam30

I would like to contribute to this

Feb 24 '23 20:02 atharvapurdue

@mattdangerw, I would like to contribute on this

Jul 29 '23 17:07 sahusiddharth

@jbischof

Context: inside the constructor(init) of WordPieceTokenizer we can assign the vocabulary in one of two ways

1.making changes in the code below, reading from file till the vocabulary size

if isinstance(vocabulary, str):
       self.vocabulary = [
                line.rstrip() for line in tf.io.gfile.GFile(vocabulary)
       ]
       elif isinstance(vocabulary, Iterable):
            # Make a copy.
            self.vocabulary = list(vocabulary)
       else:
            raise ValueError(
                "Vocabulary must be an file path or list of terms. "
                f"Received: vocabulary={vocabulary}"
            )

let it do it's thing here and make changes in all the functions and only passing the the slice of vocabulary list. something like below

def get_vocabulary(self) -> List[str]:
        """Get the tokenizer vocabulary as a list of strings tokens."""
        return self.vocabulary[:self.vocabulary_size]

something similar done in all the functions

Question: Can you give me some suggestions on how should I move forward?

Aug 02 '23 07:08 sahusiddharth

@sahusiddharth you probably want approach #1, otherwise there's no way to use this argument to load a vocabulary file too large to fit in memory.

Aug 03 '23 00:08 jbischof

@jbischof

Context: Task is to add an argument named vocabulary _size but there is a method with the same name in WordPieceTokenizer

def vocabulary_size(self) -> int:
         """Get the size of the tokenizer vocabulary."""
         return len(self.vocabulary)

Can you suggest a name other than vocabulary_size?

Aug 17 '23 15:08 sahusiddharth

hi there is this issue still open ,i love to conteibute init

Nov 13 '23 13:11 man0045

Hi @man0045,

this issue is already closed.

Nov 13 '23 13:11 sahusiddharth

keras-nlp keras-nlp copied to clipboard

Add a vocabulary_size argument to WordPieceTokenizer

Context: inside the constructor(init) of WordPieceTokenizer we can assign the vocabulary in one of two ways

Question: Can you give me some suggestions on how should I move forward?

Context: Task is to add an argument named vocabulary _size but there is a method with the same name in WordPieceTokenizer

Can you suggest a name other than vocabulary_size?

keras-nlp
keras-nlp copied to clipboard