tfds.as_numpy(...) handling of RaggedTensors
Hello, I was recently bitten by tfds.as_numpy(...) handling of RaggedTensors,
Note that because TensorFlow has support for ragged tensors and NumPy has no equivalent representation,
[tf.RaggedTensors](https://www.tensorflow.org/api_docs/python/tf/RaggedTensor)
are left as-is for the user to deal with them (e.g. using to_list()).
In TF 1 (i.e. graph mode), tf.RaggedTensors are returned as tf.ragged.RaggedTensorValues.
however the ragged tensor documentation indicates
Ragged tensors can be converted to nested Python lists and NumPy arrays: [...] digits.numpy().
It seems then that _elem_to_numpy_eager(...) would need updating.
Yes, however, the ragged_tensor.numpy() remove information on the existing ragged tensor. Depending on your use-case might, it cam make it more complicated to use. Because you would now have to manually parse the nested list.
For example, accessing flat_values or row_limits is not possible anymore on ragged_tensor.numpy(), so keeping tf allow things like:
ragged_tensor.flat_values.numpy()
ragged_tensor.row_limits().numpy()
The proper solution would be to have some custom numpy ragged tensor support, but would require more important engineering efforts.
Another option would be to have an argument in tfds.as_numpy() that indicates how to treat RaggedTensors (with default to "return as-is").
Without it I resorted to implementing my own versions of:
def _eager_dataset_iterator(ds: tf.data.Dataset) -> t.Iterator[NumpyElem]
def _elem_to_numpy_eager(tf_el: TensorflowElem) -> t.Union[NumpyElem, t.Iterable[NumpyElem]]
def tfds_as_numpy(dataset: tf.data.Dataset) -> tf.data.Dataset
where _elem_to_numpy_eager includes
elif isinstance(tf_el, tf.RaggedTensor):
return tf_el.numpy()
As long as tfds.core.dataset_utils does not change much, this will work fine for my use case; but feels like slightly abusing the library.
Also note you can also use ds.as_numpy_iterator() which should directly returns tf.RaggedTensor as numpy.
as_numpy_iterator has some difference with tfds.as_numpy though. Like not supporting None, len(ds),... but it's been a while I haven't tried, so those might have been fixed.
Indeed that seems to fit better my use case. Then I suggest that tfds.as_numpy(...) docstring should mention something along the lines of "for eager mode only use cases, ds.as_numpy_iterator(...) might be a better fit." (specially for use cases with RaggedTensor)