datasets
datasets copied to clipboard
`remove_columns` method used with a streaming enable dataset mode produces a LibsndfileError on multichannel audio
Describe the bug
When loading a HF dataset in streaming mode and removing some columns, it is impossible to load a sample if the audio contains more than one channel. I have the impression that the time axis and channels are swapped or concatenated.
Steps to reproduce the bug
Minimal error code:
from datasets import load_dataset
dataset_name = "zinc75/Vibravox_dummy"
config_name = "BWE_Larynx_microphone"
# if we use "ASR_Larynx_microphone" subset which is a monochannel audio, no error is thrown.
dataset = load_dataset(
path=dataset_name, name=config_name, split="train", streaming=True
)
dataset = dataset.remove_columns(["sensor_id"])
# dataset = dataset.map(lambda x:x, remove_columns=["sensor_id"])
# The commented version does not produce an error, but loses the dataset features.
sample = next(iter(dataset))
Error:
Traceback (most recent call last):
File "/home/julien/Bureau/github/vibravox/tmp.py", line 15, in <module>
sample = next(iter(dataset))
^^^^^^^^^^^^^^^^^^^
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1392, in __iter__
example = _apply_feature_types_on_example(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1080, in _apply_feature_types_on_example
encoded_example = features.encode_example(example)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/features.py", line 1889, in encode_example
return encode_nested_example(self, example)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/features.py", line 1244, in encode_nested_example
{k: encode_nested_example(schema[k], obj.get(k), level=level + 1) for k in schema}
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
{k: encode_nested_example(schema[k], obj.get(k), level=level + 1) for k in schema}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/features.py", line 1300, in encode_nested_example
return schema.encode_example(obj) if obj is not None else None
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/datasets/features/audio.py", line 98, in encode_example
sf.write(buffer, value["array"], value["sampling_rate"], format="wav")
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/soundfile.py", line 343, in write
with SoundFile(file, 'w', samplerate, channels,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/soundfile.py", line 658, in __init__
self._file = self._open(file, mode_int, closefd)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/julien/.pyenv/versions/vibravox/lib/python3.11/site-packages/soundfile.py", line 1216, in _open
raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BytesIO object at 0x7fd795d24680>: Format not recognised.
Process finished with exit code 1
Expected behavior
I would expect this code to run without error.
Environment info
-
datasets
version: 2.18.0 - Platform: Linux-6.5.0-21-generic-x86_64-with-glibc2.35
- Python version: 3.11.0
-
huggingface_hub
version: 0.21.3 - PyArrow version: 15.0.0
- Pandas version: 2.2.1
-
fsspec
version: 2023.10.0