llama How to load multiple GPU version without torchrun

Hi Community,

I was able to run the example.py for 13B model and see a result with two T4 GPU (16GPU) using the torchrun

 torchrun --nproc_per_node 2 example.py --ckpt_dir "/path/to/13B" --tokenizer_path "/path/to/tokenizer.model"

But how to load it so it can run using python example.py without using torchrun. In this way we can build an API for it and don't have to run example.py every time with new prompts

Mar 03 '23 07:03 ruian0

@ruian0 Could python -m torch.distributed.run --nproc_per_node 2 example.py --ckpt_dir "/path/to/13B" --tokenizer_path "/path/to/tokenizer.model" do the trick? I use an older torch version where torchrun is not available and it works for me 😅

Mar 03 '23 13:03 Logophoman

What Logophoman said. I submitted a commit yesterday to add this to the readme as torchrun doesn't work for all environments.

Mar 03 '23 17:03 Inserian

Thanks @Logophoman and @Inserian for looking into it! But I think I was looking for something like the below, For example, if we want to serve an API using flask, we can host the service with

$gunicorn api:app

and inside api.py, we can have something like

from flask import Flask
app = Flask(__name__)
@app.route('/')
def llama_completion(promt):
   return llama_engine.llama.generate(promt)

llama_engine = LLaMA()

class LLaMA:
    def __init__(self):
       self.llama = self.load_llama()
    
    def load_llama(self):
       pass

I tested this api serving and it worked for the 7B model since there is no parallelism for it. But for the 13B model, the parallelism was taken care of with torchrun + fairscale.nn.model_parallel.initialize. I was not able to find a way to load the 13B model in a similar way.

Mar 03 '23 20:03 ruian0

I have the same problem

Mar 04 '23 19:03 MohamedAliRashad

Thanks @Logophoman and @Inserian for looking into it! But I think I was looking for something like the below, For example, if we want to serve an API using flask, we can host the service with
$gunicorn api:app
and inside api.py, we can have something like
from flask import Flask
app = Flask(__name__)
@app.route('/')
def llama_completion(promt):
   return llama_engine.llama.generate(promt)

llama_engine = LLaMA()

class LLaMA:
    def __init__(self):
       self.llama = self.load_llama()
    
    def load_llama(self):
       pass
I tested this api serving and it worked for the 7B model since there is no parallelism for it. But for the 13B model, the parallelism was taken care of with torchrun + fairscale.nn.model_parallel.initialize. I was not able to find a way to load the 13B model in a similar way.

Maybe try specifying MP size prior to flask - MP_SIZE=2 gunicorn api:app

Mar 04 '23 20:03 Inserian

I got it working following the instructions in this repo https://github.com/zsc/llama_infer. It uses huggingface's transformers and accelerate to load the model. Since it no longer needs torchrun, then you can stick it in a Flask or FastAPI script no problem.

Mar 06 '23 14:03 loganlebanoff

I encountered the same issue. As you probably have seen, the issue has been solved by https://github.com/facebookresearch/llama/pull/147

Mar 08 '23 22:03 Adam4397

Using the llama2 model to build API scripts encountered the same problem. Using multiple GPUs will prompt for port occupation. Using the method in # 147, the llama2-7b-chat model can be used, but there will be no results returned for 13B and 70B, and the interface script will not report any errors.

Oct 10 '23 06:10 Maxhyl

llama llama copied to clipboard

How to load multiple GPU version without torchrun

llama
llama copied to clipboard