load_dataset with multiple jsonlines files interprets datastructure too early
Describe the bug
likely related to #6460
using datasets.load_dataset("json", data_dir= ... ) with multiple .jsonl files will error if one of the files (maybe the first file?) contains a full column of empty data.
Steps to reproduce the bug
real world example:
data is available in this PR-branch. Because my files are chunked by months, some months contain all empty data for some columns, just by chance - these are []. Otherwise it's all the same structure.
from datasets import load_dataset
ds = load_dataset("json", data_dir="./data/annotated/api")
you get a long error trace, where in the middle it says something like
TypeError: Couldn't cast array of type struct<id: int64, src: string, ctype: string, channel: int64, sampler: struct<filter: string, wrap: string, vflip: string, srgb: string, internal: string>, published: int64> to null
toy example: (on request)
Expected behavior
Some suggestions
- give a better error message to the user
- consider all files before deciding on a data structure for a given column.
- if you encounter a new structure, and can't cast that to null, replace the null-hypothesis. (maybe something for pyarrow)
as a workaround I have lazily implemented the following (essentially step 2)
import os
import jsonlines
import datasets
api_files = os.listdir("./data/annotated/api")
api_files = [f"./data/annotated/api/{f}" for f in api_files]
api_file_contents = []
for f in api_files:
with jsonlines.open(f) as reader:
for obj in reader:
api_file_contents.append(obj)
ds = datasets.Dataset.from_list(api_file_contents)
this works fine for my usecase, but is potentially slower and less memory efficient for really large datasets (where this is unlikely to happen in the first place).
Environment info
datasetsversion: 2.20.0- Platform: Windows-10-10.0.19041-SP0
- Python version: 3.9.4
huggingface_hubversion: 0.23.4- PyArrow version: 16.1.0
- Pandas version: 2.2.2
fsspecversion: 2023.10.0
I’ll take a look
Possible definitions of done for this issue:
- A fix so you can load your dataset specifically
- A general fix for datasets similar to this in the
datasetslibrary
Option 1 is trivial. I think option 2 requires significant changes to the library.
Since you outlined something akin to option 2 in Expected behavior I'm assuming that's what you'd like to see done. Is that right?
In the meantime, here's a solution for option 1:
import datasets
data_dir = './data/annotated/api'
features = datasets.Features({'id': datasets.Value(dtype='string'),
'name': datasets.Value(dtype='string'),
'author': datasets.Value(dtype='string'),
'description': datasets.Value(dtype='string'),
'tags': datasets.Sequence(feature=datasets.Value(dtype='string'), length=-1),
'likes': datasets.Value(dtype='int64'),
'viewed': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'date': datasets.Value(dtype='string'),
'time_retrieved': datasets.Value(dtype='string'),
'image_code': datasets.Value(dtype='string'),
'image_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'common_code': datasets.Value(dtype='string'),
'sound_code': datasets.Value(dtype='string'),
'sound_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_a_code': datasets.Value(dtype='string'),
'buffer_a_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_b_code': datasets.Value(dtype='string'),
'buffer_b_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_c_code': datasets.Value(dtype='string'),
'buffer_c_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_d_code': datasets.Value(dtype='string'),
'buffer_d_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'cube_a_code': datasets.Value(dtype='string'),
'cube_a_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'thumbnail': datasets.Value(dtype='string'),
'access': datasets.Value(dtype='string'),
'license': datasets.Value(dtype='string'),
'functions': datasets.Sequence(feature=datasets.Sequence(feature=datasets.Value(dtype='int64'), length=-1), length=-1),
'test': datasets.Value(dtype='string')})
datasets.load_dataset('json', data_dir=data_dir, features=features)
As pointed out by @hvaara, you can define explicit features so that you avoid the datasets library having to infer them (from the first few samples).
Note that the feature inference is done from the first few samples of JSON-Lines on purpose, so that the entire data does not need to be parsed twice (it would be inefficient for very large datasets).
I understand this. But can there be a solution that doesn't require the end user to write this shema by hand(in my case there is some fields that contain a nested structure)?
Maybe offer an option to infer the shema automatically before loading the dataset. Or perhaps - trigger such a method when this error arises?
Is this "first few files" heuristics accessible via kwargs perhaps. Maybe an error that says `Cloud not cast some structure into feature shema, consider increasing shema_files to a large number or all".
There might be efficient implementations to solve this problem for larger datasets.
@Vipitis raised a good point on the HF Discord regarding the use of a dataset script to provide the schema during initialization. Using this approach requires setting trust_remote_code=True, which is not allowed in certain evaluation frameworks.
For cases where using a dataset script is acceptable, would it be helpful to add functionality to the library (not necessarily in load_dataset) that can automatically discover the feature definitions and output them, so you don't have to manually define them?
Alternatively, for situations where features need to be known at load-time without using a dataset script, another option could be loading the dataset schema from a file format that doesn't require trust_remote_code=True.