datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Generating split is very slow when Image format is PNG

Open Tramac opened this issue 1 year ago • 1 comments

Describe the bug

When I create a dataset, it gets stuck while generating cached data. The image format is PNG, and it will not get stuck when the image format is jpeg.

image

After debugging, I know that it is because of the pa.array operation in arrow_writer, but i don't why.

Steps to reproduce the bug

from datasets import Dataset

def generator(lines):
    for line in lines:
        img = Image.open(open(line["url"], "rb"))
        # print(img.format)  # "PNG"
        yield {
            "image": img,
        }

lines = open(dataset_path, "r")
dataset = Dataset.from_generator(
    generator,
    gen_kwargs={"lines": lines}
)

Expected behavior

Generating split done.

Environment info

datasets 2.13.0

Tramac avatar Apr 03 '24 07:04 Tramac

I think this is due to the speed of reading a png image using pillow compared to a jpg image. Notably the same is true with tiff, it is even faster than jpg in my case.

Modexus avatar Apr 10 '24 17:04 Modexus