llama icon indicating copy to clipboard operation
llama copied to clipboard

How to load multiple GPU version without torchrun

Open ruian0 opened this issue 1 year ago • 7 comments

Hi Community,

I was able to run the example.py for 13B model and see a result with two T4 GPU (16GPU) using the torchrun

 torchrun --nproc_per_node 2 example.py --ckpt_dir "/path/to/13B" --tokenizer_path "/path/to/tokenizer.model"

But how to load it so it can run using python example.py without using torchrun. In this way we can build an API for it and don't have to run example.py every time with new prompts

ruian0 avatar Mar 03 '23 07:03 ruian0

@ruian0 Could python -m torch.distributed.run --nproc_per_node 2 example.py --ckpt_dir "/path/to/13B" --tokenizer_path "/path/to/tokenizer.model" do the trick? I use an older torch version where torchrun is not available and it works for me 😅

Logophoman avatar Mar 03 '23 13:03 Logophoman

What Logophoman said. I submitted a commit yesterday to add this to the readme as torchrun doesn't work for all environments.

Inserian avatar Mar 03 '23 17:03 Inserian

Thanks @Logophoman and @Inserian for looking into it! But I think I was looking for something like the below, For example, if we want to serve an API using flask, we can host the service with

$gunicorn api:app

and inside api.py, we can have something like

from flask import Flask
app = Flask(__name__)
@app.route('/')
def llama_completion(promt):
   return llama_engine.llama.generate(promt)

llama_engine = LLaMA()

class LLaMA:
    def __init__(self):
       self.llama = self.load_llama()
    
    def load_llama(self):
       pass

I tested this api serving and it worked for the 7B model since there is no parallelism for it. But for the 13B model, the parallelism was taken care of with torchrun + fairscale.nn.model_parallel.initialize. I was not able to find a way to load the 13B model in a similar way.

ruian0 avatar Mar 03 '23 20:03 ruian0

I have the same problem

MohamedAliRashad avatar Mar 04 '23 19:03 MohamedAliRashad

Thanks @Logophoman and @Inserian for looking into it! But I think I was looking for something like the below, For example, if we want to serve an API using flask, we can host the service with

$gunicorn api:app

and inside api.py, we can have something like

from flask import Flask
app = Flask(__name__)
@app.route('/')
def llama_completion(promt):
   return llama_engine.llama.generate(promt)

llama_engine = LLaMA()

class LLaMA:
    def __init__(self):
       self.llama = self.load_llama()
    
    def load_llama(self):
       pass

I tested this api serving and it worked for the 7B model since there is no parallelism for it. But for the 13B model, the parallelism was taken care of with torchrun + fairscale.nn.model_parallel.initialize. I was not able to find a way to load the 13B model in a similar way.

Maybe try specifying MP size prior to flask - MP_SIZE=2 gunicorn api:app

Inserian avatar Mar 04 '23 20:03 Inserian

I got it working following the instructions in this repo https://github.com/zsc/llama_infer. It uses huggingface's transformers and accelerate to load the model. Since it no longer needs torchrun, then you can stick it in a Flask or FastAPI script no problem.

loganlebanoff avatar Mar 06 '23 14:03 loganlebanoff

I encountered the same issue. As you probably have seen, the issue has been solved by https://github.com/facebookresearch/llama/pull/147

Adam4397 avatar Mar 08 '23 22:03 Adam4397

Using the llama2 model to build API scripts encountered the same problem. Using multiple GPUs will prompt for port occupation. Using the method in # 147, the llama2-7b-chat model can be used, but there will be no results returned for 13B and 70B, and the interface script will not report any errors.

Maxhyl avatar Oct 10 '23 06:10 Maxhyl