datasets
datasets copied to clipboard
Attempting to return a rank 3 grayscale image from dataset.map results in extreme slowdown
Describe the bug
Background: Digital images are often represented as a (Height, Width, Channel) tensor. This is the same for huggingface datasets that contain images. These images are loaded in Pillow containers which offer, for example, the .convert
method.
I can convert an image from a (H,W,3) shape to a grayscale (H,W) image and I have no problems with this. But when attempting to return a (H,W,1) shaped matrix from a map function, it never completes and sometimes even results in an OOM from the OS.
I've used various methods to expand a (H,W) shaped array to a (H,W,1) array. But they all resulted in extremely long map operations consuming a lot of CPU and RAM.
Steps to reproduce the bug
Below is a minimal example using two methods to get the desired output. Both of which don't work
import tensorflow as tf
import datasets
import numpy as np
ds = datasets.load_dataset("project-sloth/captcha-images")
to_gray_pillow = lambda sample: {'image': np.expand_dims(sample['image'].convert("L"), axis=-1)}
ds_gray = ds.map(to_gray_pillow)
# Alternatively
ds = datasets.load_dataset("project-sloth/captcha-images").with_format("tensorflow")
to_gray_tf = lambda sample: {'image': tf.expand_dims(tf.image.rgb_to_grayscale(sample['image']), axis=-1)}
ds_gray = ds.map(to_gray_tf)
Expected behavior
I expect the map operation to complete and return a new dataset containing grayscale images in a (H,W,1) shape.
Environment info
datasets 2.21.0 python tested with both 3.11 and 3.12 host os : linux