ray
ray copied to clipboard
[Data] `ray.data.from_huggingface` does not work as expected
What happened + What you expected to happen
cc: @amogkam
HuggingFace datasets
seems to do postprocessing on top of their datasets which does not get copied over when we create a Ray dataset using ray.data.from_huggingface
. This is because their postprocessing isn't a part of the underlying pyarrow table, but the dataset features
.
This results in unexpected behavior when trying to migrate huggingface batch inference code to Ray, because we instantiate the Ray Dataset using the underlying pa.Table
object.
A possible hacky workaround (See example) is to instead convert the HF dataset to e.g. pandas
or numpy
first, then to a Ray.Dataset
; currently we only support in-memory HF datasets anyway. Otherwise we could maybe call dataset.features.batch_decode
ourselves inside ray.data.from_huggingface
or something.
This issue likely applies to most image and audio datasets on HF.
Example 1: Audio
from datasets import load_dataset
dataset = load_dataset("PolyAI/minds14", "en-US", split="train[:10]")
print(dataset)
# Dataset({
# features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
# num_rows: 10
# })
print(dataset['audio'][0])
# {
# 'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
# 'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414, 0. , 0. ])
# 'sampling_rate': 8000
# }
#
# Users expect dataset['audio'] to be a dict with keys [`path`, `array`, `sampling_rate`]
print(dataset.data.schema)
# path: string
# audio: struct<bytes: binary, path: string>
# child 0, bytes: binary
# child 1, path: string
# transcription: string
# english_transcription: string
# intent_class: int64
# lang_id: int64
# -- schema metadata --
# huggingface: '{"info": {"features": {"path": {"dtype": "string", "_type":' + 615
#
# Underlying table has dict with keys `bytes`, `path`
hf_pa_ds = dataset.with_format("arrow")
print(hf_pa_ds["audio"][0])
# <pyarrow.StructScalar: [('bytes', None), ('path', '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav')]>
hf_df = dataset.with_format("pandas")
print(hf_df["audio"][0])
# {'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
# 0. , 0. ]), 'sampling_rate': 8000}
# HuggingFace implements most of these conversions using `dataset.features.batch_decode`,
# which is called every time you acces an HF dataset, and converts the underlying pyarrow
# row to the expected output format.
# Possible Workaround: Convert the HF dataset to pandas, then convert it to a Ray Dataset
# Currently we only support in-memory HF dataset,
# not memory-mapped or streaming, so this would work for now
import ray.data
ds1 = ray.data.from_huggingface(dataset)
ds2 = ray.data.from_arrow(hf_pa_ds)
ds3 = ray.data.from_pandas(hf_df)
print(ds1.take(limit=1)[0]['audio'])
# {'bytes': None, 'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'}
print(ds2.take(limit=1)[0]['audio'])
# {'bytes': None, 'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'}
print(ds3.take(limit=1)[0]['audio'])
# {'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414,
# 0. , 0. ]), 'sampling_rate': 8000}
Example 2: Image
from datasets import load_dataset
dataset = load_dataset("frgfm/imagenette", '160px', split="validation")
print(dataset)
# Dataset({
# features: ['image', 'label'],
# num_rows: 3925
# })
print(dataset['image'][0])
# <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=164x160 at 0x7F4AE766E400>
print(dataset.data.schema)
# image: struct<bytes: binary, path: string>
# child 0, bytes: binary
# child 1, path: string
# label: int64
# -- schema metadata --
# huggingface: '{"info": {"features": {"image": {"_type": "Image"}, "label"' + 180
#
hf_pa_ds = dataset.with_format("arrow")
print(hf_pa_ds["image"][0])
hf_df = dataset.with_format("pandas")
print(hf_df["image"][0])
import ray.data
ds1 = ray.data.from_huggingface(dataset)
ds2 = ray.data.from_arrow(hf_pa_ds)
ds3 = ray.data.from_pandas(hf_df)
print(ds1.take(limit=1)[0]['image'])
print(ds2.take(limit=1)[0]['image'])
print(ds3.take(limit=1)[0]['image'])
Versions / Dependencies
Ray 2.5.0
Reproduction script
See above
Issue Severity
Medium: It is a significant difficulty but I can work around it.
Can you try again on master?
Sorry, should have clarified: This is on current master. Also ds2
in the example is doing the same thing as master.
Actually, found a better solution:
ds2 = ray.data.from_arrow(hf_pa_ds).map(hf_pa_ds.features.decode_example)
Lazily applies the feature transforms used by huggingface to the Ray dataset.