datasets icon indicating copy to clipboard operation
datasets copied to clipboard

A bug of Dataset.to_json() function

Open LinglingGreat opened this issue 7 months ago • 2 comments

Describe the bug

When using the Dataset.to_json() function, an unexpected error occurs if the parameter is set to lines=False. The stored data should be in the form of a list, but it actually turns into multiple lists, which causes an error when reading the data again. The reason is that to_json() writes to the file in several segments based on the batch size. This is not a problem when lines=True, but it is incorrect when lines=False, because writing in several times will produce multiple lists(when len(dataset) > batch_size).

Steps to reproduce the bug

try this code:

from datasets import load_dataset
import json

train_dataset = load_dataset("Anthropic/hh-rlhf", data_dir="harmless-base")["train"]
output_path = "./harmless-base_hftojs.json"
print(len(train_dataset))
train_dataset.to_json(output_path, lines=False, force_ascii=False, indent=2)

with open(output_path, encoding="utf-8") as f:
    data = json.loads(f.read())

it raise error: json.decoder.JSONDecodeError: Extra data: line 4003 column 1 (char 1373709)

Extra square brackets have appeared here: image

Expected behavior

The code runs normally.

Environment info

datasets=2.20.0

LinglingGreat avatar Jul 10 '24 09:07 LinglingGreat