OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Issue training on multiple nodes

Open edwardsp opened this issue 10 months ago • 3 comments

❓ The question

I am trying to run training and I get this error when staring up:

HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/glue/paths-info/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c
[2024-04-18 15:55:06] CRITICAL [olmo.util:158, rank=6] Uncaught HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/glue/paths-info/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c

I am running on 2 nodes each with 8 GPUs, using the main branch and pytorch 2.2.2+cu121.

This works with just 1 node using 8 GPUs.

edwardsp avatar Apr 18 '24 16:04 edwardsp

I have exactly the same problem. 1 node works, but 2 node fails. I think this is a problem on huggingface side.

xijiu9 avatar Apr 18 '24 17:04 xijiu9

We run into issues like that too. We don't have a robust solution yet, but one trick we do is caching the datasets locally (or once per node or however many file systems you have) as follows and then making HF not call the hub by setting the environment variable HF_DATASETS_OFFLINE=1.

from olmo.eval.downstream import *
tokenizer = Tokenizer.from_file("tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json")
for x in label_to_task_map.values():
    kwargs = {}
    if isinstance(x, tuple):
        x, kwargs = x
    x(tokenizer=tokenizer, **kwargs)

2015aroras avatar Apr 19 '24 18:04 2015aroras

I recently merged https://github.com/allenai/OLMo/pull/623, which improves the HF loading situation. Some of the optimizations can be used just by merging in the fix. For the rest, you'll need to set hf_datasets_cache_dir in config (or pass it as an argument --hf_datasets_cache_dir=<dir> to scripts/train.py) to a folder that can be used for caching.

2015aroras avatar Jun 24 '24 16:06 2015aroras