OLMo
OLMo copied to clipboard
OLMoThreadError
❓ The question
Please advise where this error might come from: [2024-04-18 19:06:17] INFO [olmo.train:816, rank=0] [step=75/739328] train/CrossEntropyLoss=7.417 train/Perplexity=1,664 throughput/total_tokens=314,572,800 throughput/device/tokens_per_second=9,407 throughput/device/batches_per_second=0.0022 [2024-04-18 19:10:41] CRITICAL [olmo.util:158, rank=0] Uncaught OLMoThreadError: generator thread data thread 3 failed
@juripapay, can you give more details on the size of the model, batch size, GPU(AMD/Nvidia), and flash attention use? I wanted to know more regarding in which setting are you getting a throughout of 9k tokens/GPU/sec.
@juripapay - is there a traceback logged after the last line you pasted? I would expect it to log the traceback info, based on this.
Hi i encountered the same problem, would need some assistance on how to resolve
I tried training on the OLMo1b model. I didn't change anything much in the config yaml
global_train_batch_size: 2048 device_train_microbatch_size: 8 My GPU was A100, using 2nodes with 4GPU each for the azure cluster NC96ads and I didn't use flash attention
Traceback (most recent call last):
File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 300, in
I also met the same issue. I trained OLMo-1B with the provided config files. The batch size is:
global_train_batch_size: 2048
device_train_microbatch_size: 8
I used 8 NVIDIA A100 GPUs within one node. My flash attention version is flash_attn-2.5.9.post1+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64
It seems that the bug appears randomly. I tried the training for 3 times, with the command SCRATCH_DIR=<my-specific-path> torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml
. And the error appears at the 9th/3rd/5th step, respectively.
I am wondering if anyone could give some advice. Thanks.