datasets `remove_columns` method used with a streaming enable dataset mode produces a LibsndfileError on multichannel audio

`remove_columns` method used with a streaming enable dataset mode produces a LibsndfileError on multichannel audio

Open jhauret opened this issue 11 months ago • 2 comments

Describe the bug

When loading a HF dataset in streaming mode and removing some columns, it is impossible to load a sample if the audio contains more than one channel. I have the impression that the time axis and channels are swapped or concatenated.

Steps to reproduce the bug

Minimal error code:

from datasets import load_dataset

dataset_name = "zinc75/Vibravox_dummy"
config_name = "BWE_Larynx_microphone"
# if we use "ASR_Larynx_microphone" subset which is a monochannel audio, no error is thrown.

dataset = load_dataset(
    path=dataset_name, name=config_name, split="train", streaming=True
)


dataset = dataset.remove_columns(["sensor_id"])
#  dataset = dataset.map(lambda x:x, remove_columns=["sensor_id"])
# The commented version does not produce an error, but loses the dataset features.
sample = next(iter(dataset))

Error:

Traceback (most recent call last):
  File "/home/julien/Bureau/github/vibravox/tmp.py", line 15, in <module>
    sample = next(iter(dataset))
             ^^^^^^^^^^^^^^^^^^^
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1392, in __iter__
    example = _apply_feature_types_on_example(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1080, in _apply_feature_types_on_example
    encoded_example = features.encode_example(example)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/features.py", line 1889, in encode_example
    return encode_nested_example(self, example)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/features.py", line 1244, in encode_nested_example
    {k: encode_nested_example(schema[k], obj.get(k), level=level + 1) for k in schema}
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
    {k: encode_nested_example(schema[k], obj.get(k), level=level + 1) for k in schema}
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/features.py", line 1300, in encode_nested_example
    return schema.encode_example(obj) if obj is not None else None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/audio.py", line 98, in encode_example
    sf.write(buffer, value["array"], value["sampling_rate"], format="wav")
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/soundfile.py", line 343, in write
    with SoundFile(file, 'w', samplerate, channels,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BytesIO object at 0x7fd795d24680>: Format not recognised.

Process finished with exit code 1

Expected behavior

I would expect this code to run without error.

Environment info

datasets version: 2.18.0
Platform: Linux-6.5.0-21-generic-x86_64-with-glibc2.35
Python version: 3.11.0
huggingface_hub version: 0.21.3
PyArrow version: 15.0.0
Pandas version: 2.2.1
fsspec version: 2023.10.0

Mar 05 '24 09:03 jhauret

datasets datasets copied to clipboard

`remove_columns` method used with a streaming enable dataset mode produces a LibsndfileError on multichannel audio

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

datasets
datasets copied to clipboard