keras-nlp
keras-nlp copied to clipboard
ByteTokenizer doesn't accept floats in parameters
Describe the bug
ByteTokenizer only takes integers for some parameters. Which is fine. But it doesn't fail until runtime - which makes it seem like the input is the issue rather than the parameters the class was created with. And you could argue that 5.0 is the same as 5 so the class should convert the input to integers (fairly common in Python).
Source here: https://github.com/keras-team/keras-nlp/blob/v0.3.0/keras_nlp/tokenizers/byte_tokenizer.py#L24
To Reproduce
https://colab.research.google.com/drive/1NwfHBopBR9uyzvreTB4_l3gFG-iyLX0B?usp=sharing
import tensorflow as tf
import keras_nlp
tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=5.0)
ds = tf.data.Dataset.from_tensor_slices(["hello", "fun", "helloer"])
ds = ds.map(tokenizer)
ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
ds.take(1).get_single_element()
Crashes with an error log telling you about the inputs and a float value that you don't necessarily know where came from.
Expected behavior
Either
- Failure when creating the tokenizer with illegal types
- Casting floats to ints if they are the same - meaning
if int(sequence_length) == sequence_length:
sequence_length = int(sequence_length)
or something similar.
Additional context
Would you like to help us fix it?
I guess something like this in the __init__?
# Check integers.
if not (isinstance(sequence_length, int) or isinstance(sequence_length, float)):
raise ValueError(
'`sequence_length` must an integer'
f"Received: sequence_length={sequence_length}"
)
if int(sequence_length) == sequence_length:
sequence_length = int(sequence_length)
else:
raise ValueError(
'`sequence_length` must an integer'
f"Received: sequence_length={sequence_length}"
)
Repeat for replacement_char.
You probably also want them to be greater than zero? Well, you might want zero for replacement char. Isn't this the kind of stuff that Pydantic was created for? But I get why you don't want to depend on an external library for something straight-forward like this. Maybe
Thanks! I can bring this up with the team.
In terms of desired behavior. I would be inclined to not allow floats, because we probably don't want keras_nlp.tokenizers.ByteTokenizer(sequence_length=5.5) to actually work.
In terms of how much type checking we want to do, I am less sure. There's two conflicting needs here. We want our layer code to be as browsable as possible, and a lot of type validation could interfere with that. But we do also want to give good user feedback on cases we think will come up frequently. For now we are just documenting our expectation for type on the docstring. Will get back to you on this!
Lastly re pydantic, Keras as a whole will not be attempting to go for type annotations everywhere (though we may do it in some places for simple primitive types). So I'm not sure it's the perfect fit for us.
If you look at my suggestion it will also fail for sequence_length=5.5 since int(5.5) isn't the same as 5.5.
I think it makes sense. I got the sequence length from calculating the third quantile of the sequence lengths. Which I think is a reasonable way of choosing it but it is also fair that the function that calculates it returns a float. As long as you check that int(sequence_length) == sequence_length I don't see a problem with casting it to integer and it would reduce friction for users.