datasets .Jsonl metadata not detected

Describe the bug

Example of metadata1000.jsonl file {"caption": "a drawing depicts a full shot of a black t-shirt with a triangular pattern on the front there is a white label on the left side of the triangle", "image": "images/212734.png", "gaussian_padded_image": "padded_images/p_212734.png"} {"caption": "an eye-level full shot of a large elephant and a baby elephant standing in a watering hole on the left side is a small elephant with its head turned to the right of dry land, trees, and bushes", "image": "images/212735.png", "gaussian_padded_image": "padded_images/p_212735.png"} . . .

I'm trying to use dataset = load_dataset("imagefolder", data_dir='/dataset/', split='train') to load the the dataset, however it is not able to load according to the fields in the metadata1000.jsonl . please assist to load the data properly

also getting

  File "/workspace/train_trans_vae.py", line 1089, in <module>
    print(get_metadata_patterns('/dataset/'))
  File "/opt/conda/lib/python3.10/site-packages/datasets/data_files.py", line 499, in get_metadata_patterns
    raise FileNotFoundError(f"The directory at {base_path} doesn't contain any metadata file") from None
FileNotFoundError: The directory at /dataset/ doesn't contain any metadata file

when trying

    from datasets.data_files import get_metadata_patterns
    print(get_metadata_patterns('/dataset/'))

Steps to reproduce the bug

dataset Version: 2.18.0 make a similar jsonl and similar directory format

Expected behavior

creates a dataset object with the column names, caption,image,gaussian_padded_image

Environment info

dataset Version: 2.18.0

Apr 04 '24 06:04 nighting0le01

Hi! metadata.jsonl (or metadata.csv) is the only allowed name for the imagefolder's metadata files.

Apr 04 '24 13:04 mariosasko

@mariosasko hey i tried with metadata.jsonl also and it still doesn't get the right columns

Apr 04 '24 14:04 nighting0le01

@mariosasko it says metadata.csv not found

dataset = load_dataset('/dataset',metadata.csv)

| workspace || source code | dataset | |-- images | |-- metadata.csv | |-- metadata.jsonl | |-- padded_images

Example of metadata.jsonl file {"caption": "a drawing depicts a full shot of a black t-shirt with a triangular pattern on the front there is a white label on the left side of the triangle", "image": "images/212734.png", "gaussian_padded_image": "padded_images/p_212734.png"} {"caption": "an eye-level full shot of a large elephant and a baby elephant standing in a watering hole on the left side is a small elephant with its head turned to the right of dry land, trees, and bushes", "image": "images/212735.png", "gaussian_padded_image": "padded_images/p_212735.png"}

Apr 04 '24 15:04 nighting0le01

Loading more than one image per row with imagefolder is not supported currently. You can subscribe to https://github.com/huggingface/datasets/issues/5760 to see when it will be.

Instead, you can load the dataset with Dataset.from_generator:

import json
from datasets import Dataset, Value, Image, Features

def gen():
    with open("./dataset/metadata.jsonl") as f:
        for line in f:
            line = json.loads(line)
        yield {"caption": line["caption"], "image": os.path.join("./dataset", line["image"], "gaussian_padded_image": os.path.join("./dataset", line["gaussian_padded_image"]))}

features = Features({"caption": Value("string"), "image": Image(), "gaussian_padded_image": Image()})
dataset = Dataset.from_generator(gen, features=features)

(E.g., if you want to share this dataset on the Hub, you can call dataset.push_to_hub(...) afterward)

Apr 05 '24 12:04 mariosasko

hi Thanks for sharing this, Actually I was trying with a webdataset format of the data as well and it did'nt work. Could you share how i can create Dataset object from webdataset format of this data?

Apr 05 '24 21:04 nighting0le01