keras-nlp ByteTokenizer doesn't accept floats in parameters

Describe the bug

ByteTokenizer only takes integers for some parameters. Which is fine. But it doesn't fail until runtime - which makes it seem like the input is the issue rather than the parameters the class was created with. And you could argue that 5.0 is the same as 5 so the class should convert the input to integers (fairly common in Python).

Source here: https://github.com/keras-team/keras-nlp/blob/v0.3.0/keras_nlp/tokenizers/byte_tokenizer.py#L24

To Reproduce

https://colab.research.google.com/drive/1NwfHBopBR9uyzvreTB4_l3gFG-iyLX0B?usp=sharing

import tensorflow as tf 
import keras_nlp

tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=5.0)
ds = tf.data.Dataset.from_tensor_slices(["hello", "fun", "helloer"])
ds = ds.map(tokenizer)
ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
ds.take(1).get_single_element()

Crashes with an error log telling you about the inputs and a float value that you don't necessarily know where came from.

Expected behavior

Either

Failure when creating the tokenizer with illegal types
Casting floats to ints if they are the same - meaning

if int(sequence_length) == sequence_length:
    sequence_length = int(sequence_length)

or something similar.

Additional context

Would you like to help us fix it?

I guess something like this in the __init__?

# Check integers.
if not (isinstance(sequence_length, int) or isinstance(sequence_length, float)):
     raise ValueError(
                '`sequence_length` must an integer'
                f"Received: sequence_length={sequence_length}"
            )
if int(sequence_length) == sequence_length:
    sequence_length = int(sequence_length)
else:
    raise ValueError(
                '`sequence_length` must an integer'
                f"Received: sequence_length={sequence_length}"
            )

Repeat for replacement_char.

You probably also want them to be greater than zero? Well, you might want zero for replacement char. Isn't this the kind of stuff that Pydantic was created for? But I get why you don't want to depend on an external library for something straight-forward like this. Maybe

Sep 21 '22 11:09 grofte

Thanks! I can bring this up with the team.

In terms of desired behavior. I would be inclined to not allow floats, because we probably don't want keras_nlp.tokenizers.ByteTokenizer(sequence_length=5.5) to actually work.

In terms of how much type checking we want to do, I am less sure. There's two conflicting needs here. We want our layer code to be as browsable as possible, and a lot of type validation could interfere with that. But we do also want to give good user feedback on cases we think will come up frequently. For now we are just documenting our expectation for type on the docstring. Will get back to you on this!

Lastly re pydantic, Keras as a whole will not be attempting to go for type annotations everywhere (though we may do it in some places for simple primitive types). So I'm not sure it's the perfect fit for us.

Sep 21 '22 18:09 mattdangerw

If you look at my suggestion it will also fail for sequence_length=5.5 since int(5.5) isn't the same as 5.5.

I think it makes sense. I got the sequence length from calculating the third quantile of the sequence lengths. Which I think is a reasonable way of choosing it but it is also fair that the function that calculates it returns a float. As long as you check that int(sequence_length) == sequence_length I don't see a problem with casting it to integer and it would reduce friction for users.

Sep 21 '22 21:09 grofte

keras-nlp keras-nlp copied to clipboard

ByteTokenizer doesn't accept floats in parameters

keras-nlp
keras-nlp copied to clipboard