.Jsonl metadata not detected
Describe the bug
Hi I have the following directory structure: |--dataset | |-- images | |-- metadata1000.csv | |-- metadata1000.jsonl | |-- padded_images
Example of metadata1000.jsonl file {"caption": "a drawing depicts a full shot of a black t-shirt with a triangular pattern on the front there is a white label on the left side of the triangle", "image": "images/212734.png", "gaussian_padded_image": "padded_images/p_212734.png"} {"caption": "an eye-level full shot of a large elephant and a baby elephant standing in a watering hole on the left side is a small elephant with its head turned to the right of dry land, trees, and bushes", "image": "images/212735.png", "gaussian_padded_image": "padded_images/p_212735.png"} . . .
I'm trying to use dataset = load_dataset("imagefolder", data_dir='/dataset/', split='train') to load the the dataset, however it is not able to load according to the fields in the metadata1000.jsonl . please assist to load the data properly
also getting
File "/workspace/train_trans_vae.py", line 1089, in <module>
print(get_metadata_patterns('/dataset/'))
File "/opt/conda/lib/python3.10/site-packages/datasets/data_files.py", line 499, in get_metadata_patterns
raise FileNotFoundError(f"The directory at {base_path} doesn't contain any metadata file") from None
FileNotFoundError: The directory at /dataset/ doesn't contain any metadata file
when trying
from datasets.data_files import get_metadata_patterns
print(get_metadata_patterns('/dataset/'))
Steps to reproduce the bug
dataset Version: 2.18.0 make a similar jsonl and similar directory format
Expected behavior
creates a dataset object with the column names, caption,image,gaussian_padded_image
Environment info
dataset Version: 2.18.0
Hi! metadata.jsonl (or metadata.csv) is the only allowed name for the imagefolder's metadata files.
@mariosasko hey i tried with metadata.jsonl also and it still doesn't get the right columns
@mariosasko it says metadata.csv not found
dataset = load_dataset('/dataset',metadata.csv)
| workspace || source code | dataset | |-- images | |-- metadata.csv | |-- metadata.jsonl | |-- padded_images
Example of metadata.jsonl file {"caption": "a drawing depicts a full shot of a black t-shirt with a triangular pattern on the front there is a white label on the left side of the triangle", "image": "images/212734.png", "gaussian_padded_image": "padded_images/p_212734.png"} {"caption": "an eye-level full shot of a large elephant and a baby elephant standing in a watering hole on the left side is a small elephant with its head turned to the right of dry land, trees, and bushes", "image": "images/212735.png", "gaussian_padded_image": "padded_images/p_212735.png"}
Loading more than one image per row with imagefolder is not supported currently. You can subscribe to https://github.com/huggingface/datasets/issues/5760 to see when it will be.
Instead, you can load the dataset with Dataset.from_generator:
import json
from datasets import Dataset, Value, Image, Features
def gen():
with open("./dataset/metadata.jsonl") as f:
for line in f:
line = json.loads(line)
yield {"caption": line["caption"], "image": os.path.join("./dataset", line["image"], "gaussian_padded_image": os.path.join("./dataset", line["gaussian_padded_image"]))}
features = Features({"caption": Value("string"), "image": Image(), "gaussian_padded_image": Image()})
dataset = Dataset.from_generator(gen, features=features)
(E.g., if you want to share this dataset on the Hub, you can call dataset.push_to_hub(...) afterward)
hi Thanks for sharing this, Actually I was trying with a webdataset format of the data as well and it did'nt work. Could you share how i can create Dataset object from webdataset format of this data?