TensorFlow RaggedTensor Support (batch-level)
Feature request
Hi,
Currently datasets does not support RaggedTensor output on batch-level. When building a Object Detection Dataset (with TensorFlow) I need to enable RaggedTensors as that's how BBoxes & classes are expected from the Keras Model POV.
Currently there's a error thrown saying that "Nested Data is not supported".
It'd be very helpful if this was fixed! :)
Motivation
Enabling Object Detection pipelines for TensorFlow.
Your contribution
With guidance I'd happily help making the PR.
The current implementation with DataCollator and later enforcing np.array is the problematic part (at the end of np_get_batch in tf_utils.py). As numpy don't support "Raggednes"
Keras doesn't support other inputs other than tf.data.Dataset objects ? it's a bit painful to have to support and maintain this kind of integration
Is there a way to use a datasets.Dataset with outputs formatted as tensors / ragged tensors instead ? like in https://huggingface.co/docs/datasets/use_with_tensorflow#dataset-format
I'll give it a try when I get the time. But quite sure I already tested the with_format approach.
Keras when using TF as backend converts the datasets into tf.data.Dataset, much like you do.
Hi @Lundez! Thanks for raising this — very valid point, especially for Object Detection use-cases.
You're right that np_get_batch currently enforces numpy batching, which breaks RaggedTensor support due to its inability to handle nested structures. This likely needs a redesign to allow TensorFlow-native batching in specific formats.
Before diving into a code change though, could you confirm:
Does .with_format("tensorflow") (without batching) return a tf.data.Dataset that works if batching is deferred to model.fit()?
Have you tried something like:
tf_dataset = dataset.with_format("tensorflow").to_tf_dataset(
columns=["image", "labels"],
label_cols=None,
batch_size=None # No batching here
)
model.fit(tf_dataset.batch(BATCH_SIZE)) # Use RaggedTensor batching here
If this works, it might be worth updating the documentation rather than changing batching logic inside datasets itself.
That said, happy to explore changes if batching needs to be supported natively for RaggedTensor. Just flagging that it’d require some careful design due to existing numpy assumptions.
Hi, we've had to move on for now.
We have actually also moved to dense tensors to make it possible to xla complie the training.
But I'll check when I'm back from vacation which is far into the future.
Thanks