datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Inconsistency between load_dataset and load_from_disk functionality

Open zzzzzec opened this issue 9 months ago • 1 comments

Issue Description

I've encountered confusion when using load_dataset and load_from_disk in the datasets library. Specifically, when working offline with the gsm8k dataset, I can load it using a local path:

import datasets
ds = datasets.load_dataset('/root/xxx/datasets/gsm8k', 'main')

output:

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})

This works as expected. However, after processing the dataset (converting answer format from #### to \boxed{})

import datasets
ds = datasets.load_dataset('/root/xxx/datasets/gsm8k', 'main')
ds_train = ds['train']
ds_test = ds['test']
import re
def convert(sample):
    solution = sample['answer']
    solution = re.sub(r'####\s*(\S+)', r'\\boxed{\1}', solution)
    sample = {
        'problem': sample['question'],
        'solution': solution
    }
    return sample

ds_train = ds_train.map(convert, remove_columns=['question', 'answer'])
ds_test = ds_test.map(convert,remove_columns=['question', 'answer'])

I saved it using save_to_disk:

from datasets.dataset_dict import DatasetDict
data_dict = DatasetDict({
    'train': ds_train,
    'test': ds_test
})
data_dict.save_to_disk('/root/xxx/datasets/gsm8k-new')

But now I can only load it using load_from_disk:

new_ds = load_from_disk('/root/xxx/datasets/gsm8k-new')

output:

DatasetDict({
    train: Dataset({
        features: ['problem', 'solution'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['problem', 'solution'],
        num_rows: 1319
    })
})

Attempting to use load_dataset produces unexpected results:

new_ds = load_dataset('/root/xxx/datasets/gsm8k-new')

output:

DatasetDict({
    train: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
    test: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
})

Questions

  1. Why is it designed such that after using save_to_disk, the dataset cannot be loaded with load_dataset? For small projects with limited code, it might be relatively easy to change all instances of load_dataset to load_from_disk. However, for complex frameworks like TRL or lighteval, diving into the framework code to change load_dataset to load_from_disk is extremely tedious and error-prone. Additionally, load_from_disk cannot load datasets directly downloaded from the hub, which means that if you need to modify a dataset, you have to choose between using load_from_disk or load_dataset. This creates an unnecessary dichotomy in the API and complicates workflow when working with modified datasets.
  2. What's the recommended approach for this use case? Should I manually process my gsm8k-new dataset to make it compatible with load_dataset? Is there a standard way to convert between these formats?

thanks~

zzzzzec avatar Apr 08 '25 03:04 zzzzzec

Hi ! you can find more info here: https://github.com/huggingface/datasets/issues/5044#issuecomment-1263714347

What's the recommended approach for this use case? Should I manually process my gsm8k-new dataset to make it compatible with load_dataset? Is there a standard way to convert between these formats?

You can use push_to_hub() or to_parquet() for example

lhoestq avatar Apr 15 '25 12:04 lhoestq

Hi @zzzzzec & @lhoestq 👋

Thanks for raising and discussing this — I've submitted a patch that improves this exact scenario.

ArjunJagdale avatar Jun 28 '25 08:06 ArjunJagdale