datasets
datasets copied to clipboard
Question: Is there any way for uploading a large image dataset?
I am uploading an image dataset like this:
dataset = load_dataset(
"json",
data_files={"train": "data/custom_dataset/train.json", "validation": "data/custom_dataset/val.json"},
)
dataset = dataset.cast_column("images", Sequence(Image()))
dataset.push_to_hub("StanfordAIMI/custom_dataset", max_shard_size="1GB")
where it takes a long time in the Map
process. Do you think I can use multi-processing to map all the image data to the memory first? For the Map()
function, I can set num_proc
. But for push_to_hub
and cast_column
, I can not find it.
Thanks in advance!
Best,
import pandas as pd
from datasets import Dataset, Image
# Read the CSV file
data = pd.read_csv("XXXX.csv")
# Create a Hugging Face Dataset
dataset = Dataset.from_pandas(data)
dataset = dataset.cast_column("file_name", Image())
# Upload to Hugging Face Hub (make sure authentication is set up)
dataset.push_to_hub("XXXXX"")
stuck in "Casting the dataset
"