llama
llama copied to clipboard
How to load multiple GPU version without torchrun
Hi Community,
I was able to run the example.py for 13B model and see a result with two T4 GPU (16GPU) using the torchrun
torchrun --nproc_per_node 2 example.py --ckpt_dir "/path/to/13B" --tokenizer_path "/path/to/tokenizer.model"
But how to load it so it can run using python example.py
without using torchrun. In this way we can build an API for it and don't have to run example.py every time with new prompts
@ruian0 Could python -m torch.distributed.run --nproc_per_node 2 example.py --ckpt_dir "/path/to/13B" --tokenizer_path "/path/to/tokenizer.model"
do the trick? I use an older torch version where torchrun is not available and it works for me 😅
What Logophoman said. I submitted a commit yesterday to add this to the readme as torchrun doesn't work for all environments.
Thanks @Logophoman and @Inserian for looking into it! But I think I was looking for something like the below, For example, if we want to serve an API using flask, we can host the service with
$gunicorn api:app
and inside api.py, we can have something like
from flask import Flask
app = Flask(__name__)
@app.route('/')
def llama_completion(promt):
return llama_engine.llama.generate(promt)
llama_engine = LLaMA()
class LLaMA:
def __init__(self):
self.llama = self.load_llama()
def load_llama(self):
pass
I tested this api serving and it worked for the 7B model since there is no parallelism for it. But for the 13B model, the parallelism was taken care of with torchrun + fairscale.nn.model_parallel.initialize. I was not able to find a way to load the 13B model in a similar way.
I have the same problem
Thanks @Logophoman and @Inserian for looking into it! But I think I was looking for something like the below, For example, if we want to serve an API using flask, we can host the service with
$gunicorn api:app
and inside api.py, we can have something like
from flask import Flask app = Flask(__name__) @app.route('/') def llama_completion(promt): return llama_engine.llama.generate(promt) llama_engine = LLaMA() class LLaMA: def __init__(self): self.llama = self.load_llama() def load_llama(self): pass
I tested this api serving and it worked for the 7B model since there is no parallelism for it. But for the 13B model, the parallelism was taken care of with torchrun + fairscale.nn.model_parallel.initialize. I was not able to find a way to load the 13B model in a similar way.
Maybe try specifying MP size prior to flask - MP_SIZE=2 gunicorn api:app
I got it working following the instructions in this repo https://github.com/zsc/llama_infer. It uses huggingface's transformers
and accelerate
to load the model. Since it no longer needs torchrun, then you can stick it in a Flask or FastAPI script no problem.
I encountered the same issue. As you probably have seen, the issue has been solved by https://github.com/facebookresearch/llama/pull/147
Using the llama2 model to build API scripts encountered the same problem. Using multiple GPUs will prompt for port occupation. Using the method in # 147, the llama2-7b-chat model can be used, but there will be no results returned for 13B and 70B, and the interface script will not report any errors.