datasets
datasets copied to clipboard
ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB when adding image to Dataset
Describe the bug
When adding a Pillow image to an existing Dataset on the hub, add_item
fails due to the Pillow image not being automatically converted into the Image feature.
Steps to reproduce the bug
from datasets import load_dataset
from PIL import Image
dataset = load_dataset("hf-internal-testing/example-documents")
# load any random Pillow image
image = Image.open("/content/cord_example.png").convert("RGB")
new_image = {'image': image}
dataset['test'] = dataset['test'].add_item(new_image)
Expected results
The image should be automatically casted to the Image feature when using add_item
. For now, this can be fixed by using encode_example
:
import datasets
feature = datasets.Image(decode=False)
new_image = {'image': feature.encode_example(image)}
dataset['test'] = dataset['test'].add_item(new_image)
Actual results
ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB size=576x864 at 0x7F7CCC4589D0> with type Image: did not recognize Python value type when inferring an Arrow data type
@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:
from datasets import Dataset, Features, Image, Value
import pandas as pd
import requests
import PIL
# we need to define the features ourselves
features = Features({
'a': Value(dtype='int32'),
'b': Image(),
})
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)
df = pd.DataFrame({"a": [1, 2],
"b": [image, image]})
dataset = Dataset.from_pandas(df, features=features)
results in
ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')
Will the PR linked above also fix that?
I would expect this to work, but it doesn't. Shouldn't be too hard to fix tho (in a subsequent PR).
Hi @mariosasko just wanted to check in if there is a PR to follow for this. I was looking to create a demo app using this. If it's not working I can just use byte encoded images in the dataset which are not displayed.
Hi @darraghdog! No PR yet, but I plan to fix this before the next release.
I was just pointed here by @mariosasko, meanwhile I found a workaround using encode_example
like so:
from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [] for k in ds1[99].keys()},
features=ds1.features
)
for i in range(2):
# could add several representative items here
row = ds1[99]
row_encoded = ds2.features.encode_example(row)
ds2 = ds2.add_item(row_encoded)
Hmm, interesting. If I create the dataset on the fly:
from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [v]*2 for k, v in ds1[99].items()},
features=ds1.features)
it doesn't fail with the error in the OP, as from_dict
performs encode_batch
.
However if I try to use this dataset it fails now with:
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 524, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper
out = func(self, *args, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2775, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2655, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2347, in decorated
result = f(decorated_item, *args, **kwargs)
File "debug_leak2.py", line 235, in split_pack_and_pad
images.append(image_transform(image.convert("RGB")))
AttributeError: 'dict' object has no attribute 'convert'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "debug_leak2.py", line 418, in <module>
train_loader, val_loader = get_dataloaders()
File "debug_leak2.py", line 348, in get_dataloaders
dataset = dataset.map(mapper, batch_size=32, batched=True, remove_columns=dataset.column_names, num_proc=4)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2500, in map
transformed_shards[index] = async_result.get()
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
AttributeError: 'dict' object has no attribute 'convert'
but if I create that same dataset one item at a time as in the previous comment's code snippet it doesn't fail.
The features of this dataset are set to:
{'texts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'images': Sequence(feature=Image(decode=True, id=None), length=-1, id=None)}
@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:
from datasets import Dataset, Features, Image, Value import pandas as pd import requests import PIL # we need to define the features ourselves features = Features({ 'a': Value(dtype='int32'), 'b': Image(), }) url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = PIL.Image.open(requests.get(url, stream=True).raw) df = pd.DataFrame({"a": [1, 2], "b": [image, image]}) dataset = Dataset.from_pandas(df, features=features)
results in
ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')
Will the PR linked above also fix that?
It looks like the problem still exists. Any news ? Any good workaround ?
Thank you
There is a workaround: Create a loader python scrypt and upload the dataset to huggingface.
Here is an example how to do that:
https://huggingface.co/datasets/jamescalam/image-text-demo/tree/main
and Here are videos with explanations:
https://www.youtube.com/watch?v=lqK4ocAKveE and https://www.youtube.com/watch?v=ODdKC30dT8c
cc @mariosasko gentle ping for a fix :)
Any update on this? I'm still facing this issure. Any workaround?
I was facing the same issue. Downgrading datasets from 2.11.0 to 2.4.0 solved the issue.
Any update on this? I'm still facing this issure. Any workaround?
I was able to resolve my issue with a quick workaround:
from collections import defaultdict
from datasets import Dataset
data = defaultdict(list)
for idx in tqdm(range( len(dataloader)),desc="Captioning..."):
img = dataloader[idx]
data['image'].append(img)
data['text'].append(f"{img_{idx}})
dataset = Dataset.from_dict(data)
dataset = dataset.filter(lambda example: example['image'] is not None)
dataset = dataset.filter(lambda example: example['text'] is not None)
dataset.push_to_hub(path-to-repo', private=False)
Hope it helps! Happy coding
Any update on this? I'm still facing this issure. Any workaround?
I was able to resolve my issue with a quick workaround:
from collections import defaultdict from datasets import Dataset data = defaultdict(list) for idx in tqdm(range( len(dataloader)),desc="Captioning..."): img = dataloader[idx] data['image'].append(img) data['text'].append(f"{img_{idx}}) dataset = Dataset.from_dict(data) dataset = dataset.filter(lambda example: example['image'] is not None) dataset = dataset.filter(lambda example: example['text'] is not None) dataset.push_to_hub(path-to-repo', private=False)
Hope it helps! Happy coding
It works!!
how did this work, how to use this script or where to paste it?
I had a similar issue to @NielsRogge where I was unable to create a dataset from a Pandas DataFrame containing PIL.Images.
I found another workaround that works in this case which involves converting the DataFrame to a python dictionary, and then creating a dataset from said python dictionary.
This is a generic example of my workaround. The example assumes that you have your data in a Pandas DataFrame variable called "dataframe" plus a dictionary of your data's features in a variable called "features".
import datasets
dictionary = dataframe.to_dict(orient='list')
dataset = datasets.Dataset.from_dict(dictionary, features=features)
cc @mariosasko this issue has been open for 2 years, would be great to resolve it :)
I have the same issue, my current workaround is saving the dataframe to a csv and then loading the dataset from the csv. Would also appreciate it a fix :)