datasets
datasets copied to clipboard
Generating split is very slow when Image format is PNG
Describe the bug
When I create a dataset, it gets stuck while generating cached data. The image format is PNG, and it will not get stuck when the image format is jpeg.
After debugging, I know that it is because of the pa.array operation in arrow_writer, but i don't why.
Steps to reproduce the bug
from datasets import Dataset
def generator(lines):
for line in lines:
img = Image.open(open(line["url"], "rb"))
# print(img.format) # "PNG"
yield {
"image": img,
}
lines = open(dataset_path, "r")
dataset = Dataset.from_generator(
generator,
gen_kwargs={"lines": lines}
)
Expected behavior
Generating split done.
Environment info
datasets 2.13.0
I think this is due to the speed of reading a png image using pillow compared to a jpg image.
Notably the same is true with tiff, it is even faster than jpg in my case.