nanoGPT
nanoGPT copied to clipboard
Value Error when preparing openwebtext
When I run prepare.py of openwebtext, I always get a valueerror:could not broadcast input array from shape (7782,) into shape (5605,) at the same step of the process to write train.bin What could I do?
I have a similar error: "ValueError: could not broadcast input array from shape (8665630,) into shape (4537790,)" on line 65 "arr[idx : idx + len(arr_batch)] = arr_batch"
I've had the same problem (Running on Windows 10 with python 3.10.9).
It seems to me that it occurs because the array_len
calculated here https://github.com/karpathy/nanoGPT/blob/master/data/openwebtext/prepare.py#L53 is less than the number in idx
which is the sum of the batches.
I've solved by running the loop twice, the first time just to calculate arr_len
.
# concatenate all the ids in each dataset into one large file we can use for training
for split, dset in tokenized.items():
# this number is ~4537790, it should should be ~9035582489
# arr_len = np.sum(dset['len'])
filename = os.path.join(os.path.dirname(__file__), f'{split}.bin')
dtype = np.uint16 # (can do since enc.max_token_value == 50256 is < 2**16)
total_batches = 1024
# calculate by iterating over the shards
arr_len = 0
for batch_idx in tqdm(range(total_batches), desc=f'calculate size {filename}'):
# Batch together samples for faster write
batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
arr_batch = np.concatenate(batch['ids'])
arr_len += len(arr_batch)
print(f'calculated size {arr_len}')
arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
idx = 0
for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
# Batch together samples for faster write
batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
arr_batch = np.concatenate(batch['ids'])
# Write into mmap
arr[idx : idx + len(arr_batch)] = arr_batch
idx += len(arr_batch)
arr.flush()
print(f'idx size {idx}')
Not super efficient but it worked, now, as written in comments, I've train.bin ~17Gb and val.bin ~8.5MB I didn't start the training though.