nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Value Error when preparing openwebtext

Open SinoKiwi opened this issue 1 year ago • 2 comments

When I run prepare.py of openwebtext, I always get a valueerror:could not broadcast input array from shape (7782,) into shape (5605,) at the same step of the process to write train.bin What could I do?

SinoKiwi avatar Apr 03 '23 09:04 SinoKiwi

I have a similar error: "ValueError: could not broadcast input array from shape (8665630,) into shape (4537790,)" on line 65 "arr[idx : idx + len(arr_batch)] = arr_batch"

AndrewGBalaschak avatar Apr 27 '23 03:04 AndrewGBalaschak

I've had the same problem (Running on Windows 10 with python 3.10.9). It seems to me that it occurs because the array_len calculated here https://github.com/karpathy/nanoGPT/blob/master/data/openwebtext/prepare.py#L53 is less than the number in idx which is the sum of the batches.

I've solved by running the loop twice, the first time just to calculate arr_len.

    # concatenate all the ids in each dataset into one large file we can use for training
    for split, dset in tokenized.items():
        # this number is ~4537790, it should should be ~9035582489
        # arr_len = np.sum(dset['len'])
        filename = os.path.join(os.path.dirname(__file__), f'{split}.bin')
        dtype = np.uint16 # (can do since enc.max_token_value == 50256 is < 2**16)
        total_batches = 1024
        
        # calculate by iterating over the shards
        arr_len = 0
        for batch_idx in tqdm(range(total_batches), desc=f'calculate size {filename}'):
            # Batch together samples for faster write
            batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
            arr_batch = np.concatenate(batch['ids'])
            arr_len += len(arr_batch)
        print(f'calculated size {arr_len}')
        arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))

        idx = 0
        for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
            # Batch together samples for faster write
            batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
            arr_batch = np.concatenate(batch['ids'])
            # Write into mmap
            arr[idx : idx + len(arr_batch)] = arr_batch
            idx += len(arr_batch)
        arr.flush()
        print(f'idx size {idx}')

Not super efficient but it worked, now, as written in comments, I've train.bin ~17Gb and val.bin ~8.5MB I didn't start the training though.

giannicic avatar Apr 29 '23 14:04 giannicic