datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Question: Is there any way for uploading a large image dataset?

Open zhjohnchan opened this issue 1 year ago • 1 comments

I am uploading an image dataset like this:

dataset = load_dataset(
    "json",
    data_files={"train": "data/custom_dataset/train.json", "validation": "data/custom_dataset/val.json"},
)
dataset = dataset.cast_column("images", Sequence(Image()))
dataset.push_to_hub("StanfordAIMI/custom_dataset", max_shard_size="1GB")

where it takes a long time in the Map process. Do you think I can use multi-processing to map all the image data to the memory first? For the Map() function, I can set num_proc. But for push_to_hub and cast_column, I can not find it.

Thanks in advance!

Best,

zhjohnchan avatar Feb 21 '24 22:02 zhjohnchan

import pandas as pd
from datasets import Dataset, Image

# Read the CSV file
data = pd.read_csv("XXXX.csv")

# Create a Hugging Face Dataset
dataset = Dataset.from_pandas(data)
dataset = dataset.cast_column("file_name", Image())

# Upload to Hugging Face Hub (make sure authentication is set up)
dataset.push_to_hub("XXXXX"")

stuck in "Casting the dataset 截屏2024-05-02 11 44 50 "

dirtycomputer avatar May 02 '24 03:05 dirtycomputer