Add flatten_indices option to save_to_disk method

Open ArjunJagdale opened this issue 2 months ago • 1 comments

Added flatten_indices parameter to control index flattening during dataset saving. Solves #7861

This PR introduces a new optional argument, flatten_indices, to the save_to_disk methods in both Dataset and DatasetDict.

The change allows users to skip the expensive index-flattening step when saving datasets that already use index mappings (e.g., after filter() or shuffle()), resulting in significant speed improvements for large datasets while maintaining backward compatibility.

While not a huge absolute difference at 100K rows, the improvement scales significantly with larger datasets (millions of rows).

This patch gives users control — they can disable flattening when they don’t need it, avoiding unnecessary rewrites.

@lhoestq WDYT?

Nov 12 '25 19:11 ArjunJagdale

as said by @KCKawalkar used below script to test -

BEFORE PATCH - TEST.PY:

from datasets import Dataset
import time

dataset = Dataset.from_dict({'text': [f'sample {i}' for i in range(100000)]})

# Baseline save (no indices)
start = time.time()
dataset.save_to_disk('baseline')
baseline_time = time.time() - start

# Filtered save (creates indices)
filtered = dataset.filter(lambda x: True)
start = time.time()
filtered.save_to_disk('filtered')
filtered_time = time.time() - start

print(f"Baseline: {baseline_time:.3f}s")
print(f"Filtered: {filtered_time:.3f}s")
print(f"Slowdown: {(filtered_time/baseline_time-1)*100:.1f}%")

RESULTS:

@ArjunJagdale ➜ /workspaces/datasets (main) $ python test_arjun.py
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 3030654.07 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 576296.61 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 310565.19 examples/s]
Baseline: 0.035s
Filtered: 0.323s
Slowdown: 813.4%

AFTER PATCH - TEST.PY:

from datasets import Dataset
import time

# Create dataset
dataset = Dataset.from_dict({'text': [f'sample {i}' for i in range(100000)]})

# Baseline save (no indices)
start = time.time()
dataset.save_to_disk('baseline')
baseline_time = time.time() - start

# Filtered save (creates indices)
filtered = dataset.filter(lambda x: True)
start = time.time()
filtered.save_to_disk('filtered', flatten_indices=False)
filtered_time = time.time() - start

print(f"Baseline: {baseline_time:.3f}s")
print(f"Filtered: {filtered_time:.3f}s") 
print(f"Slowdown: {(filtered_time/baseline_time-1)*100:.1f}%")

REESULT:

@ArjunJagdale ➜ /workspaces/datasets (main) $ python test_arjun.py
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 3027482.12 examples/s]
Filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 468901.89 examples/s]
Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:00<00:00, 324036.36 examples/s]
Baseline: 0.036s
Filtered: 0.310s
Slowdown: 771.1%

Nov 12 '25 19:11 ArjunJagdale