pythia icon indicating copy to clipboard operation
pythia copied to clipboard

seed2 shuffle_idx.npy corrupted

Open efittschen opened this issue 4 months ago • 0 comments

Hi, I think the shuffle_idx.npy for seed2 is corrupted.

Can someone else reproduce these results? (a copy of the issue exists on huggingface: https://huggingface.co/datasets/EleutherAI/pile-preshuffled-seeds/discussions/1)

Expected: The shuffle_idx file is a permutation, so every index should appear once.

Observed: All entries between index 12354528 and index 162164735 are zero (about 92%).

import numpy as np

path = "seed2/pile_20B_tokenizer_text_document_train_0_indexmap_147164160ns_2048sl_2s_shuffle_idx.npy"

shuffle = np.load(path, allow_pickle=True)
u, c = np.unique(shuffle, return_counts=True)
print(u[:10]) 
# out: [ 0 12 21 25 38 44 68 71 82 90]
print(c[:10]) 
# out: [149810208 1 1 1 ...

first, last = np.where(shuffle == 0)[0][[0, -1]]

np.unique(shuffle[first: last+1], return_counts=True) 
# out: (array([0], dtype=uint32), array([149810207]))

efittschen avatar Jul 24 '25 21:07 efittschen