datasets
datasets copied to clipboard
Errror when saving to disk a dataset of images
Describe the bug
Hello!
I have an issue when I try to save on disk my dataset of images. The error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1442, in save_to_disk
for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1473, in _save_to_disk_single
writer.write_table(pa_table)
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_writer.py", line 570, in write_table
pa_table = embed_table_storage(pa_table)
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2268, in embed_table_storage
arrays = [
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2269, in <listcomp>
embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2142, in embed_array_storage
return feature.embed_storage(array)
File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/features/image.py", line 269, in embed_storage
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
File "pyarrow/array.pxi", line 2766, in pyarrow.lib.StructArray.from_arrays
File "pyarrow/array.pxi", line 2961, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean
My dataset is around 50K images, is this error might be due to a bad image?
Thanks for the help.
Steps to reproduce the bug
from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/dataset")
dataset["train"].save_to_disk("./myds", num_shards=40)
Expected behavior
Having my dataset properly saved to disk.
Environment info
-
datasets
version: 2.11.0 - Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.10
- Huggingface_hub version: 0.13.3
- PyArrow version: 11.0.0
- Pandas version: 2.0.0
Looks like as long as the number of shards makes a batch lower than 1000 images it works. In my training set I have 40K images. If I use num_shards=40
(batch of 1000 images) I get the error, but if I update it to num_shards=50
(batch of 800 images) it works.
I will be happy to share my dataset privately if it can help to better debug.
Hi! I didn't manage to reproduce this behavior, so sharing the dataset with us would help a lot.
My dataset is around 50K images, is this error might be due to a bad image?
This shouldn't be the case as we save raw data to disk without decoding it.
OK, thanks! The dataset is currently hosted on a gcs bucket. How would you like to proceed for sharing the link?
You could follow this procedure or upload the dataset to Google Drive (50K images is not that much unless high-res) and send me an email with the link.
Thanks @mariosasko. I just sent you the GDrive link.
Thanks @jplu! I managed to reproduce the TypeError
- it stems from this line, which can return a ChunkedArray
(its mask is also chunked then, which Arrow does not allow) when the embedded data is too big to fit in a standard Array
.
I'm working on a fix.
@yairl-dn You should be able to bypass this issue by reducing datasets.config.DEFAULT_MAX_BATCH_SIZE
(1000 by default)
In Datasets 3.0, the Image storage format will be simplified, so this should be easier to fix then.
The same error occurs with my save_to_disk() of Audio() items. I still get it with:
import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE=35
from datasets import Features, Array2D, Value, Dataset, Sequence, Audio
Saving the dataset (41/47 shards): 88%|██████████████████████████████████████████▉ | 297/339 [01:21<00:11, 3.65 examples/s]
Traceback (most recent call last):
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 155, in <module>
create_dataset(args)
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 137, in create_dataset
hf_dataset.save_to_disk(args.outds)
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1532, in save_to_disk
for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1563, in _save_to_disk_single
writer.write_table(pa_table)
File "/home/j/src/py/datasets/src/datasets/arrow_writer.py", line 574, in write_table
pa_table = embed_table_storage(pa_table)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2307, in embed_table_storage
arrays = [
^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2308, in <listcomp>
embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 1831, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 1831, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2177, in embed_array_storage
return feature.embed_storage(array)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/features/audio.py", line 276, in embed_storage
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 2850, in pyarrow.lib.StructArray.from_arrays
File "pyarrow/array.pxi", line 3290, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean
Similar to @jaggzh, setting datasets.config.DEFAULT_MAX_BATCH_SIZE
did not help in my case (same error here but for different dataset: https://github.com/Stanford-AIMI/RRG24/issues/2).
This is also reproducible with this open dataset: https://huggingface.co/datasets/nlphuji/winogavil/discussions/1
Here's some code to do so:
import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE = 1
from datasets import load_dataset
ds = load_dataset("nlphuji/winogavil")
ds.save_to_disk("temp")
I've done some more debugging with datasets==2.18.0
(which incorporates PR #6283 as suggested by @lhoestq in the above dataset discussion), and it seems like the culprit might now be these lines: https://github.com/huggingface/datasets/blob/ca8409a8bec4508255b9c3e808d0751eb1005260/src/datasets/table.py#L2111-L2115
From what I understand (and apologies I'm new to pyarrow), for an Image or Audio feature, these lines recursively call embed_array_storage
for a list of either feature, ending up in the feature's embed_storage
function. For all values in the list, embed_storage
reads the bytes if they're not already loaded. The issue is the list being passed to the first recursive call is array.values
which are the underlying values of array
regardless of array
's slicing (as influenced by parameters such as datasets.config.DEFAULT_MAX_BATCH_SIZE
). This results in the same overflowing list of bytes that result in the ChunkedArray being returned in embed_storage
. Even if the array weren't to overflow and this code ran without throwing an exception, it still seems incorrect to load all values if you ultimately only want some subset with ListArray.from_arrays(offsets, values)
; it seems some wasted effort if those values thrown out will get loaded again in the next batch and vice versa for the current batch of values during later batches.
Maybe there's a fix where you could pass a mask to embed_storage
such that it only loads the values you ultimately want for the current batch? Curious to see if you agree with this diagnosis of the problem and if you think this fix is viable @mariosasko?
Would be nice if they have something similar to Dagshub's S3 sync; it worked like a charm for my bigger datasets.
I guess also the proposed masking solution simply enables datasets.config.DEFAULT_MAX_BATCH_SIZE
by reducing the number of elements loaded, it does not address the underlying problem of trying to load all the images as bytes into a pyarrow array.
I'm happy to turn this into an actual PR but here's what I've implemented locally at tables.py:embed_array_storage
to fix the above test case (nlphuji/winogavil
) and my own use case:
elif pa.types.is_list(array.type):
# feature must be either [subfeature] or Sequence(subfeature)
# Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
array_offsets = _combine_list_array_offsets_with_mask(array)
# mask underlying struct array so array_values.to_pylist()
# fills None (see feature.embed_storage)
idxs = np.arange(len(array.values))
idxs = pa.ListArray.from_arrays(array_offsets, idxs).flatten()
mask = np.ones(len(array.values)).astype(bool)
mask[idxs] = False
mask = pa.array(mask)
# indexing 0 might be problematic but not sure
# how else to get arbitrary keys from a struct array
array_keys = array.values[0].keys()
# is array.values always a struct array?
array_values = pa.StructArray.from_arrays(
arrays=[array.values.field(k) for k in array_keys],
names=array_keys,
mask=mask,
)
if isinstance(feature, list):
return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature[0]))
if isinstance(feature, Sequence) and feature.length == -1:
return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature.feature))
Again though I'm new to pyarrow so this might not be the cleanest implementation, also I'm really not sure if there are other cases where this solution doesn't work. Would love to get some feedback from the hf folks!
I have the same issue, with an audio dataset where file sizes vary significantly (~0.2-200 mb). Reducing datasets.config.DEFAULT_MAX_BATCH_SIZE
doesn't help.
Still the problem is occured. Huggingface is sucks 🤮🤮🤮🤮
Came across this issue myself, with the same symptoms and reasons as everyone else; pa.array
is returning a ChunkedArray
in features.audio.Audio.embed_storage
for my audio which varies between ~1MB and ~10MB in size.
I would rather remove a troublesome file from my dataset than have to switch off this library, but it would be difficult to identify which file(s) caused the issue, and it may just shift the issue down to another shard or another file anyway. So, I took the path of least resistance and simply dropped anything beyond the first chunk when this issue occurred, and added a warning to indicate what was dropped.
In the end I lost one file out of 105,024 samples and was able to complete the 1,479 shard dataset after only the one issue on shard 228.
While this is certainly not an ideal solution, it does represent a much better user experience, and was acceptable for my use case. I'm going to test the Image portion and then open a pull request to propose this "lossy" behavior become the way these edge cases are handled (maybe behind an environment flag?) until someone like @mariosasko or others can formulate a more holistic solution.
My work-in-progress "fix": https://github.com/huggingface/datasets/compare/main...painebenjamin:datasets:main (https://github.com/painebenjamin/datasets)
Another option could be to use pa.large_binary
instead of pa.binary
in certain cases ?