llama
llama copied to clipboard
AttributeError: 'NoneType' object has no attribute 'get' when running torchrun
I encountered an error when running torchrun command on my system with the following traceback:
Traceback (most recent call last):
File "/mnt/f/projects/python/git/llama/venv/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
self._initialize_workers(self._worker_group)
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 681, in _initialize_workers
worker_ids = self._start_workers(worker_group)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 271, in _start_workers
self._pcontext = start_processes(
^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/__init__.py", line 207, in start_processes
redirs = to_map(redirects, nprocs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 162, in to_map
map[i] = val_or_map.get(i, Std.NONE)
^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'
I am using torchrun with --nproc_per_node 1 option and passing the example.py script as an argument. I also provided the --ckpt_dir and --tokenizer_path arguments to the script. I have downloaded the 7B files and verified the checksum, and $TARGET_FOLDER has been set. I am not sure what caused this error and how to resolve it.
Here is the command I ran:
$ torchrun --nproc_per_node 1 example.py --ckpt_dir $TARGET_FOLDER/7B --tokenizer_path $TARGET_FOLDER/tokenizer.model
Can you please help me diagnose the issue and find a solution? Thank you.
Obvious reason: Input parameter errors: It is possible that errors or omitted key parameters in the input parameters can cause the program to fail. You may need to check if these issues exist in the program code.
You cannot just copy the sample command.
the $TARGET_FOLDER
means you should use your local folder path to replace it, which is the folder you download the models.
In my case, the folder is 'some_path/llama/models/7B', so I would use './models' to replace the $TARGET_FOLDER
.
@tsaijamey Thank you for your response. I just wanted to clarify that I have already set the $TARGET_FOLDER variable to the correct folder path where the 7B files are located. I added the following portion of code in the main method to check if the folder is correctly set:
def main(ckpt_dir: str, tokenizer_path: str, temperature: float = 0.8, top_p: float = 0.95):
print("ckpt_dir: ", ckpt_dir)
print("tokenizer_path: ", tokenizer_path)
if not os.path.isfile(tokenizer_path):
print(f"{tokenizer_path} not exists ")
if not os.path.isdir(ckpt_dir):
print(f"{ckpt_dir} not exists ")
print("all is fine")
exit()
local_rank, world_size = setup_model_parallel()
if local_rank > 0:
sys.stdout = open(os.devnull, 'w')
...
I execute the following command without torchrun, since torchrun gives me the error I already mentioned.
python example.py --ckpt_dir $TARGET_FOLDER/7B --tokenizer_path $TARGET_FOLDER/tokenizer.model
The output:
ckpt_dir: /mnt/f/projects/python/git/llama/models/7B
tokenizer_path: /mnt/f/projects/python/git/llama/models/tokenizer.model
all is fine
I hope this helps clarify the issue. If there is anything else that needs to be checked, please let me know.
It's a pytorch bug, try Python 3.10 until it is fixed.
this may work as a hack for those trying python 3.11
--- /home/YOU/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py~ 2022-12-07 17:11:01.763871538 -0500
+++ /home/YOU/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py 2023-03-03 22:08:41.714570686 -0500
@@ -159,7 +159,7 @@
else:
map = {}
for i in range(local_world_size):
- map[i] = val_or_map.get(i, Std.NONE)
+ map[i] = val_or_map.get(i, Std.NONE) if val_or_map else Std.NONE
return map