llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

when running python train_gpt2.py, errors out after 10 iteration -- is this normal?

Open JamesHuang2004 opened this issue 1 year ago • 7 comments

(base) billhuang@bh-m1-max llm.c % python train_gpt2.py
using device: mps loading weights from pretrained gpt: gpt2 generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 95.7kB/s] loading cached tokens in data/tiny_shakespeare_val.bin /Users/billhuang/TEST/llm.c/train_gpt2.py:333: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:205.) tokens = torch.from_numpy(tokens) wrote gpt2_124M.bin wrote gpt2_124M_debug_state.bin iteration 0, loss: 5.270007133483887 iteration 1, loss: 4.059707164764404 iteration 2, loss: 3.375124931335449 iteration 3, loss: 2.8007795810699463 iteration 4, loss: 2.3153889179229736 iteration 5, loss: 1.849020004272461 iteration 6, loss: 1.3946489095687866 iteration 7, loss: 0.9991437196731567 iteration 8, loss: 0.6240723729133606 iteration 9, loss: 0.376505047082901 Traceback (most recent call last): File "/Users/billhuang/TEST/llm.c/train_gpt2.py", line 380, in y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k) File "/Users/billhuang/miniforge3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/Users/billhuang/TEST/llm.c/train_gpt2.py", line 202, in generate v, _ = torch.topk(logits, min(top_k, logits.size(-1))) RuntimeError: Currently topk on mps works only for k<=16

JamesHuang2004 avatar Apr 09 '24 15:04 JamesHuang2004

I was able to do the rest steps afer this in readme.md, so I assume this is by design?

JamesHuang2004 avatar Apr 09 '24 16:04 JamesHuang2004

Consider upgrading torch.

chsasank avatar Apr 09 '24 16:04 chsasank

Reported this previously https://github.com/karpathy/llm.c/issues/8

goswamig avatar Apr 09 '24 21:04 goswamig

yes upgrading pytorch-2.2 worked fine. @chsasank do you know what caused this overflow error in old version of pytorch ?

goswamig avatar Apr 09 '24 21:04 goswamig

Thanks guys. I just upgrade torch, but when rerunning the command, getting an abortion and some complaint. Is this normal?

Installing collected packages: mpmath, typing-extensions, sympy, networkx, torch, torchvision, torchaudio Attempting uninstall: typing-extensions Found existing installation: typing_extensions 4.4.0 Uninstalling typing_extensions-4.4.0: Successfully uninstalled typing_extensions-4.4.0 Attempting uninstall: torch Found existing installation: torch 1.13.1 Uninstalling torch-1.13.1: Successfully uninstalled torch-1.13.1 Attempting uninstall: torchvision Found existing installation: torchvision 0.14.1 Uninstalling torchvision-0.14.1: Successfully uninstalled torchvision-0.14.1 Attempting uninstall: torchaudio Found existing installation: torchaudio 0.13.1 Uninstalling torchaudio-0.13.1: Successfully uninstalled torchaudio-0.13.1 Successfully installed mpmath-1.3.0 networkx-3.2.1 sympy-1.12 torch-2.2.2 torchaudio-2.2.2 torchvision-0.17.2 typing-extensions-4.11.0 (base) billhuang@bh-m1-max llm.c % pip show torch Name: torch Version: 2.2.2 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: [email protected] License: BSD-3 Location: /Users/billhuang/miniforge3/lib/python3.9/site-packages Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions Required-by: torchaudio, torchvision (base) billhuang@bh-m1-max llm.c % python train_gpt2.py OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ zsh: abort python train_gpt2.py

JamesHuang2004 avatar Apr 09 '24 23:04 JamesHuang2004

you seems to be in some base env. you need to double check if all your required packages are installed and path are setup correctly. There is also Hint section for above which you can use.

goswamig avatar Apr 09 '24 23:04 goswamig

yes I am using a base environment, this is by design, however, due to the fear that additional mess might be introduced here. For this specific issue, what could be the problem?

(base) billhuang@bh-m1-max llm.c % python train_gpt2.py OMP: Error https://github.com/karpathy/llm.c/pull/15: Initializing libomp.dylib, but found libomp.dylib already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ zsh: abort python train_gpt2.py

JamesHuang2004 avatar Apr 09 '24 23:04 JamesHuang2004

close due to lack of further response

JamesHuang2004 avatar Apr 10 '24 23:04 JamesHuang2004