datasets icon indicating copy to clipboard operation
datasets copied to clipboard

tfds.as_numpy(...) handling of RaggedTensors

Open rodrigob opened this issue 3 years ago • 4 comments

Hello, I was recently bitten by tfds.as_numpy(...) handling of RaggedTensors,

Note that because TensorFlow has support for ragged tensors and NumPy has no equivalent representation,
 [tf.RaggedTensors](https://www.tensorflow.org/api_docs/python/tf/RaggedTensor)
 are left as-is for the user to deal with them (e.g. using to_list()).
In TF 1 (i.e. graph mode), tf.RaggedTensors are returned as tf.ragged.RaggedTensorValues.

however the ragged tensor documentation indicates Ragged tensors can be converted to nested Python lists and NumPy arrays: [...] digits.numpy().

It seems then that _elem_to_numpy_eager(...) would need updating.

rodrigob avatar May 02 '22 09:05 rodrigob

Yes, however, the ragged_tensor.numpy() remove information on the existing ragged tensor. Depending on your use-case might, it cam make it more complicated to use. Because you would now have to manually parse the nested list.

For example, accessing flat_values or row_limits is not possible anymore on ragged_tensor.numpy(), so keeping tf allow things like:

ragged_tensor.flat_values.numpy()
ragged_tensor.row_limits().numpy()

The proper solution would be to have some custom numpy ragged tensor support, but would require more important engineering efforts.

Conchylicultor avatar May 03 '22 08:05 Conchylicultor

Another option would be to have an argument in tfds.as_numpy() that indicates how to treat RaggedTensors (with default to "return as-is").

Without it I resorted to implementing my own versions of:

def _eager_dataset_iterator(ds: tf.data.Dataset) -> t.Iterator[NumpyElem]
def _elem_to_numpy_eager(tf_el: TensorflowElem) -> t.Union[NumpyElem, t.Iterable[NumpyElem]]
def tfds_as_numpy(dataset: tf.data.Dataset) -> tf.data.Dataset

where _elem_to_numpy_eager includes

elif isinstance(tf_el, tf.RaggedTensor):
    return tf_el.numpy() 

As long as tfds.core.dataset_utils does not change much, this will work fine for my use case; but feels like slightly abusing the library.

rodrigob avatar May 04 '22 09:05 rodrigob

Also note you can also use ds.as_numpy_iterator() which should directly returns tf.RaggedTensor as numpy.

as_numpy_iterator has some difference with tfds.as_numpy though. Like not supporting None, len(ds),... but it's been a while I haven't tried, so those might have been fixed.

Conchylicultor avatar May 04 '22 11:05 Conchylicultor

Indeed that seems to fit better my use case. Then I suggest that tfds.as_numpy(...) docstring should mention something along the lines of "for eager mode only use cases, ds.as_numpy_iterator(...) might be a better fit." (specially for use cases with RaggedTensor)

rodrigob avatar May 04 '22 11:05 rodrigob