datasets Errror when saving to disk a dataset of images

Describe the bug

Hello!

I have an issue when I try to save on disk my dataset of images. The error I get is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1442, in save_to_disk
    for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1473, in _save_to_disk_single
    writer.write_table(pa_table)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_writer.py", line 570, in write_table
    pa_table = embed_table_storage(pa_table)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2268, in embed_table_storage
    arrays = [
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2269, in <listcomp>
    embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2142, in embed_array_storage
    return feature.embed_storage(array)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/features/image.py", line 269, in embed_storage
    storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
  File "pyarrow/array.pxi", line 2766, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 2961, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

My dataset is around 50K images, is this error might be due to a bad image?

Thanks for the help.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/dataset")
dataset["train"].save_to_disk("./myds", num_shards=40)

Expected behavior

Having my dataset properly saved to disk.

Environment info

datasets version: 2.11.0
Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python version: 3.10.10
Huggingface_hub version: 0.13.3
PyArrow version: 11.0.0
Pandas version: 2.0.0

Apr 07 '23 11:04 jplu

Looks like as long as the number of shards makes a batch lower than 1000 images it works. In my training set I have 40K images. If I use num_shards=40 (batch of 1000 images) I get the error, but if I update it to num_shards=50 (batch of 800 images) it works.

I will be happy to share my dataset privately if it can help to better debug.

Apr 07 '23 15:04 jplu

Hi! I didn't manage to reproduce this behavior, so sharing the dataset with us would help a lot.

My dataset is around 50K images, is this error might be due to a bad image?

This shouldn't be the case as we save raw data to disk without decoding it.

Apr 14 '23 17:04 mariosasko

OK, thanks! The dataset is currently hosted on a gcs bucket. How would you like to proceed for sharing the link?

Apr 14 '23 17:04 jplu

You could follow this procedure or upload the dataset to Google Drive (50K images is not that much unless high-res) and send me an email with the link.

Apr 14 '23 18:04 mariosasko

Thanks @mariosasko. I just sent you the GDrive link.

Apr 17 '23 09:04 jplu

Thanks @jplu! I managed to reproduce the TypeError - it stems from this line, which can return a ChunkedArray (its mask is also chunked then, which Arrow does not allow) when the embedded data is too big to fit in a standard Array.

I'm working on a fix.

May 09 '23 17:05 mariosasko

@yairl-dn You should be able to bypass this issue by reducing datasets.config.DEFAULT_MAX_BATCH_SIZE (1000 by default)

In Datasets 3.0, the Image storage format will be simplified, so this should be easier to fix then.

Sep 25 '23 13:09 mariosasko

The same error occurs with my save_to_disk() of Audio() items. I still get it with:

import datasets
datasets.config.DEFAULT_MAX_BATCH_SIZE=35
from datasets import Features, Array2D, Value, Dataset, Sequence, Audio

Saving the dataset (41/47 shards):  88%|██████████████████████████████████████████▉      | 297/339 [01:21<00:11,  3.65 examples/s]
Traceback (most recent call last):
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 155, in <module>
create_dataset(args)
File "/mnt/ddrive/prj/voice/voice-training-dataset-create/./dataset.py", line 137, in create_dataset
hf_dataset.save_to_disk(args.outds)
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1532, in save_to_disk
for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
File "/home/j/src/py/datasets/src/datasets/arrow_dataset.py", line 1563, in _save_to_disk_single
writer.write_table(pa_table)
File "/home/j/src/py/datasets/src/datasets/arrow_writer.py", line 574, in write_table
pa_table = embed_table_storage(pa_table)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2307, in embed_table_storage
arrays = [
^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2308, in <listcomp>
embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 1831, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 1831, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/table.py", line 2177, in embed_array_storage
return feature.embed_storage(array)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j/src/py/datasets/src/datasets/features/audio.py", line 276, in embed_storage
storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 2850, in pyarrow.lib.StructArray.from_arrays
File "pyarrow/array.pxi", line 3290, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

Nov 08 '23 10:11 jaggzh

Similar to @jaggzh, setting datasets.config.DEFAULT_MAX_BATCH_SIZE did not help in my case (same error here but for different dataset: https://github.com/Stanford-AIMI/RRG24/issues/2).

This is also reproducible with this open dataset: https://huggingface.co/datasets/nlphuji/winogavil/discussions/1

Here's some code to do so:

import datasets

datasets.config.DEFAULT_MAX_BATCH_SIZE = 1

from datasets import load_dataset

ds = load_dataset("nlphuji/winogavil")

ds.save_to_disk("temp")

I've done some more debugging with datasets==2.18.0 (which incorporates PR #6283 as suggested by @lhoestq in the above dataset discussion), and it seems like the culprit might now be these lines: https://github.com/huggingface/datasets/blob/ca8409a8bec4508255b9c3e808d0751eb1005260/src/datasets/table.py#L2111-L2115

From what I understand (and apologies I'm new to pyarrow), for an Image or Audio feature, these lines recursively call embed_array_storage for a list of either feature, ending up in the feature's embed_storage function. For all values in the list, embed_storage reads the bytes if they're not already loaded. The issue is the list being passed to the first recursive call is array.values which are the underlying values of array regardless of array's slicing (as influenced by parameters such as datasets.config.DEFAULT_MAX_BATCH_SIZE). This results in the same overflowing list of bytes that result in the ChunkedArray being returned in embed_storage. Even if the array weren't to overflow and this code ran without throwing an exception, it still seems incorrect to load all values if you ultimately only want some subset with ListArray.from_arrays(offsets, values); it seems some wasted effort if those values thrown out will get loaded again in the next batch and vice versa for the current batch of values during later batches.

Maybe there's a fix where you could pass a mask to embed_storage such that it only loads the values you ultimately want for the current batch? Curious to see if you agree with this diagnosis of the problem and if you think this fix is viable @mariosasko?

Mar 12 '24 03:03 StevenSong

Would be nice if they have something similar to Dagshub's S3 sync; it worked like a charm for my bigger datasets.

Mar 12 '24 03:03 yairl

I guess also the proposed masking solution simply enables datasets.config.DEFAULT_MAX_BATCH_SIZE by reducing the number of elements loaded, it does not address the underlying problem of trying to load all the images as bytes into a pyarrow array.

I'm happy to turn this into an actual PR but here's what I've implemented locally at tables.py:embed_array_storage to fix the above test case (nlphuji/winogavil) and my own use case:

    elif pa.types.is_list(array.type):
        # feature must be either [subfeature] or Sequence(subfeature)
        # Merge offsets with the null bitmap to avoid the "Null bitmap with offsets slice not supported" ArrowNotImplementedError
        array_offsets = _combine_list_array_offsets_with_mask(array)

        # mask underlying struct array so array_values.to_pylist()
        # fills None (see feature.embed_storage)
        idxs = np.arange(len(array.values))
        idxs = pa.ListArray.from_arrays(array_offsets, idxs).flatten()
        mask = np.ones(len(array.values)).astype(bool)
        mask[idxs] = False
        mask = pa.array(mask)
        # indexing 0 might be problematic but not sure
        # how else to get arbitrary keys from a struct array
        array_keys = array.values[0].keys()
        # is array.values always a struct array?
        array_values = pa.StructArray.from_arrays(
            arrays=[array.values.field(k) for k in array_keys],
            names=array_keys,
            mask=mask,
        )
        if isinstance(feature, list):
            return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature[0]))
        if isinstance(feature, Sequence) and feature.length == -1:
            return pa.ListArray.from_arrays(array_offsets, _e(array_values, feature.feature))

Again though I'm new to pyarrow so this might not be the cleanest implementation, also I'm really not sure if there are other cases where this solution doesn't work. Would love to get some feedback from the hf folks!

Mar 12 '24 14:03 StevenSong

I have the same issue, with an audio dataset where file sizes vary significantly (~0.2-200 mb). Reducing datasets.config.DEFAULT_MAX_BATCH_SIZE doesn't help.

Mar 18 '24 08:03 AJDERS

Still the problem is occured. Huggingface is sucks 🤮🤮🤮🤮

Jul 04 '24 07:07 20141888

Came across this issue myself, with the same symptoms and reasons as everyone else; pa.array is returning a ChunkedArray in features.audio.Audio.embed_storage for my audio which varies between ~1MB and ~10MB in size.

I would rather remove a troublesome file from my dataset than have to switch off this library, but it would be difficult to identify which file(s) caused the issue, and it may just shift the issue down to another shard or another file anyway. So, I took the path of least resistance and simply dropped anything beyond the first chunk when this issue occurred, and added a warning to indicate what was dropped.

In the end I lost one file out of 105,024 samples and was able to complete the 1,479 shard dataset after only the one issue on shard 228.

While this is certainly not an ideal solution, it does represent a much better user experience, and was acceptable for my use case. I'm going to test the Image portion and then open a pull request to propose this "lossy" behavior become the way these edge cases are handled (maybe behind an environment flag?) until someone like @mariosasko or others can formulate a more holistic solution.

My work-in-progress "fix": https://github.com/huggingface/datasets/compare/main...painebenjamin:datasets:main (https://github.com/painebenjamin/datasets)

Sep 05 '24 23:09 painebenjamin

Another option could be to use pa.large_binary instead of pa.binary in certain cases ?

Sep 06 '24 13:09 lhoestq

datasets datasets copied to clipboard

Errror when saving to disk a dataset of images

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

datasets
datasets copied to clipboard