OLMo
OLMo copied to clipboard
IndexError in OLMo-7B pre-training dataset
❓ The question
Hello, while using the code to check what sequences exist on a specific batch index in the OLMo-7B pre-training dataset, IndexError: 925801835 is out of bounds for dataset of size 925201012
occurred, so I would like to inquire.
1. Preparation
- The .npy files in the data.path of ./configs/official/OLMo-7B.yaml were saved to disk using the
wget
command. - I changed data.path in OLMo-7B.yaml to the local path where I just downloaded the data.
2. Executing the code
- Here is the code I used:
data_order_file_path = cached_path("https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy")
train_config_path = os.path.join(os.path.dirname(FILE_PATH), "OLMo_config/OLMo-7B.yaml")
cfg = TrainConfig.load(train_config_path)
batch_size = cfg.global_train_batch_size
global_indices = np.memmap(data_order_file_path, mode="r+", dtype=np.uint32)
dataset = build_memmap_dataset(cfg, cfg.data)
def get_batch_instances(batch_idx: int) -> list[list[int]]:
batch_start = batch_idx * batch_size
batch_end = (batch_idx + 1) * batch_size
batch_indices = global_indices[batch_start:batch_end]
batch_instances = []
for index in batch_indices:
token_ids = dataset[index]["input_ids"].tolist()
batch_instances.append(token_ids)
return batch_instances
def main():
steps = [1]
results = [False for i in range(len(steps))]
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B", trust_remote_code=True)
for i, step in enumerate(steps):
batch = torch.tensor(get_batch_instances(batch_idx=step))
# <class: 'list'>, len : 2048
batch_in_text = tokenizer.batch_decode(batch, skip_special_tokens=True)
for sequence in batch_in_text:
if 'apple'.lower() in sequence.lower():
results[i] = True
continue
print(results)
if __name__=="__main__":
main()
3. Detailed Error Message
> Traceback (most recent call last):
> File "test.py", line 96, in <module>
> main()
> File "test.py", line 83, in main
> batch = torch.tensor(get_batch_instances(batch_idx=step))
> File "test.py", line 60, in get_batch_instances
> token_ids = dataset[index]["input_ids"].tolist()
> File "site-packages/olmo/data/memmap_dataset.py", line 176, in __getitem__
> raise IndexError(f"{index} is out of bounds for dataset of size {len(self)}")
> IndexError: 925801835 is out of bounds for dataset of size 925201012
Is the OLMo-7B pre-training corpus saved at this urls wrong? Or is there a problem with the dataset saved at this url and something went wrong when I downloaded it?
4. Additional Question
- If you look at the page related to Huggingface's Dolma dataset, the URL where the corpus used to pre-train OLMo-7B is stored is different from the one specified in ./configs/official/OLMo-7B.yaml. Do these two urls play the same role?
I wonder whether this issue has been resolved
Is the path you provided, https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy, correct?