codellama
codellama copied to clipboard
Unable to run example_completion.py on CodeLlama-7b
Hi, I have a single GPU on my system and I am using CodeLlama-7b to test my environment. I am running into the following error when I run the sample.
$ torchrun --nproc_per_node 1 example_completion.py \
--ckpt_dir CodeLlama-7b \
--tokenizer_path CodeLlama-7b/tokenizer.model \
--max_seq_len 128 --max_batch_size 1
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
File "/home/aditya/rb16/Code/llama-ft/codellama/example_completion.py", line 53, in <module>
fire.Fire(main)
File "/home/aditya/anaconda3/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/aditya/rb16/Code/llama-ft/codellama/example_completion.py", line 20, in main
generator = Llama.build(
^^^^^^^^^^^^
File "/home/aditya/rb16/Code/llama-ft/codellama/llama/generation.py", line 102, in build
checkpoint = torch.load(ckpt_path, map_location="cpu")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1026, in load
return _load(opened_zipfile,
^^^^^^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1438, in _load
result = unpickler.load()
^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1408, in persistent_load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/serialization.py", line 1373, in load_tensor
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 5] Input/output error
[2024-02-17 13:26:43,422] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3852309) of binary: /home/aditya/anaconda3/bin/python
Traceback (most recent call last):
File "/home/aditya/anaconda3/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aditya/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_completion.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-17_13:26:43
host : stormbreaker
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3852309)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
$ ls -ltr ./CodeLlama-7b
total 13169098
-rw-rw-r-- 1 aditya aditya 500058 Aug 21 14:32 tokenizer.model
-rw-rw-r-- 1 aditya aditya 163 Aug 21 14:32 params.json
-rw-rw-r-- 1 aditya aditya 13477187307 Aug 21 14:32 consolidated.00.pth
-rw-rw-r-- 1 aditya aditya 150 Aug 21 14:32 checklist.chk
$ echo $CUDA_VISIBLE_DEVICES
0
The conda env
channels:
- pytorch
- nvidia
dependencies:
- numpy
- pandas
- pytorch-cuda=12.1
- pytorch
- torchvision
- torchaudio
variables:
CUDA_PATH: /usr/local/cuda-12.1
Hi @aditya4d1, to rule out corrupted files (which the error message seems to point to), can you run md5sum -c checklist.chk
in the CodeLlama-7b directory?
@jgehring
md5sum: consolidated.00.pth: Input/output error
consolidated.00.pth: FAILED open or read
params.json: OK
tokenizer.model: OK
md5sum: WARNING: 1 listed file could not be read
should i re-download the weights?
Update: Re-downloaded the weights. Ran into checksum error again
Checking checksums
consolidated.00.pth: FAILED
params.json: OK
tokenizer.model: OK
md5sum: WARNING: 1 computed checksum did NOT match
ping @jgehring