llama AttributeError: 'NoneType' object has no attribute 'get' when running torchrun

I encountered an error when running torchrun command on my system with the following traceback:

Traceback (most recent call last):
  File "/mnt/f/projects/python/git/llama/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 681, in _initialize_workers
    worker_ids = self._start_workers(worker_group)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 271, in _start_workers
    self._pcontext = start_processes(
                     ^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/__init__.py", line 207, in start_processes
    redirs = to_map(redirects, nprocs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/projects/python/git/llama/venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 162, in to_map
    map[i] = val_or_map.get(i, Std.NONE)
             ^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

I am using torchrun with --nproc_per_node 1 option and passing the example.py script as an argument. I also provided the --ckpt_dir and --tokenizer_path arguments to the script. I have downloaded the 7B files and verified the checksum, and $TARGET_FOLDER has been set. I am not sure what caused this error and how to resolve it.

Here is the command I ran:

$ torchrun --nproc_per_node 1 example.py --ckpt_dir $TARGET_FOLDER/7B --tokenizer_path $TARGET_FOLDER/tokenizer.model

Can you please help me diagnose the issue and find a solution? Thank you.

Mar 03 '23 08:03 aminechraibi

Obvious reason: Input parameter errors: It is possible that errors or omitted key parameters in the input parameters can cause the program to fail. You may need to check if these issues exist in the program code.

You cannot just copy the sample command. the $TARGET_FOLDER means you should use your local folder path to replace it, which is the folder you download the models.

In my case, the folder is 'some_path/llama/models/7B', so I would use './models' to replace the $TARGET_FOLDER.

Mar 03 '23 09:03 tsaijamey

@tsaijamey Thank you for your response. I just wanted to clarify that I have already set the $TARGET_FOLDER variable to the correct folder path where the 7B files are located. I added the following portion of code in the main method to check if the folder is correctly set:

def main(ckpt_dir: str, tokenizer_path: str, temperature: float = 0.8, top_p: float = 0.95):
    print("ckpt_dir: ", ckpt_dir)
    print("tokenizer_path: ", tokenizer_path)
    if not os.path.isfile(tokenizer_path):
        print(f"{tokenizer_path} not exists ")
    if not os.path.isdir(ckpt_dir):
        print(f"{ckpt_dir} not exists ")
    print("all is fine")
    exit()
    local_rank, world_size = setup_model_parallel()
    if local_rank > 0:
        sys.stdout = open(os.devnull, 'w')
    ...

I execute the following command without torchrun, since torchrun gives me the error I already mentioned.

python example.py --ckpt_dir $TARGET_FOLDER/7B --tokenizer_path $TARGET_FOLDER/tokenizer.model

The output:

ckpt_dir:  /mnt/f/projects/python/git/llama/models/7B
tokenizer_path:  /mnt/f/projects/python/git/llama/models/tokenizer.model
all is fine

I hope this helps clarify the issue. If there is anything else that needs to be checked, please let me know.

Mar 03 '23 14:03 aminechraibi

It's a pytorch bug, try Python 3.10 until it is fixed.

Mar 04 '23 00:03 markasoftware

this may work as a hack for those trying python 3.11

--- /home/YOU/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py~	2022-12-07 17:11:01.763871538 -0500
+++ /home/YOU/.local/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/api.py	2023-03-03 22:08:41.714570686 -0500
@@ -159,7 +159,7 @@
     else:
         map = {}
         for i in range(local_world_size):
-            map[i] = val_or_map.get(i, Std.NONE)
+            map[i] = val_or_map.get(i, Std.NONE) if val_or_map else Std.NONE
         return map

Mar 06 '23 15:03 fche

llama llama copied to clipboard

AttributeError: 'NoneType' object has no attribute 'get' when running torchrun

llama
llama copied to clipboard