Megatron-DeepSpeed Add generation server scripts using HF accelerate and DS-inference

Add generation server scripts using HF accelerate and DS-inference

Open mayank31398 opened this issue 1 year ago • 12 comments

Currently, the scripts are working correctly. However, there is some memory leak. In https://github.com/huggingface/accelerate/issues/614 @sgugger says that its not in accelerate which is probably true.

My hunch is a related bug: https://github.com/apache/incubator-mxnet/issues/19159

Aug 11 '22 08:08 mayank31398

This is still WIP @stas00

Aug 11 '22 08:08 mayank31398

@stas00, I believe the code is in good shape. You can start reviewing. Also, I'll start updating the README for instructions tomorrow

Aug 16 '22 22:08 mayank31398

Oh wow, you developed a whole package around this. Impressive work, @mayank31398

Let's just discuss the user-facing APIs before implementing to minimize wasting time.

Originally, these were just scripts to benchmark and demo how the various solutions can be used, but now I'm not so sure how demo-friendly these are with so many separate files.

Some design questions:

Why does CLI import from the server code?
would it really make things simpler to have one cli script for all the different systems? they are quite different after all with very different arguments - e.g. ds-zero will need nvmi path as well and flags for cpu/nvme offload? I just had those hardcoded - but there are 3 ways to invoke this one. gpu/cpu-offload/nvme-offload
the while loop in cli.py isn't debug friendly - can we activate it only with an explicit flag, but by default just use pre-made inputs - or perhaps 2 separate scripts? I'm not even sure who would ever want to manually type text - perhaps having a flag to feed it a file with prompts instead? not sure.
multiple input BS>1 functionality got lost in transition
naming of the scripts lost their mnemonics - cli.py is very ambiguous since it tells me nothing of what it does. Please see my suggestions at https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328#pullrequestreview-1070168477, but I suppose that the way it's going it's no longer bloom-specific either, and can probably work with any other model.

I'm also no longer sure what to do with this package. Originally the plan was to stash the original demo scripts under transformers' research files, but now it's growing towards an independent package. Not a bad thing, I'm just thinking aloud.

Aug 17 '22 17:08 stas00

I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.
I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.
I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.
batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now).
I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.

Let me know your thoughts on this. Thanks for the review :)

Aug 17 '22 19:08 mayank31398

Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM. I can open up another PR to support other models. Adding so much functionality in a single PR might not be the best thing to do.

Aug 17 '22 19:08 mayank31398

I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.

so you are not benchmarking the same thing then.

you don't need to broadcast anything for non-server usage. It sounds like you built cli.py as a client to the server, but that's too complicated for someone who just wants to process a few inputs.

The intention of the cli was the way it was in the original code - tokenize + generate + detokenize - done.

Perhaps we have miscommunicated here.

2. I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.

I answered earlier in the code comment.

3. I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.

I meant the script should work out of the box and require no user input. I think we have a conflict of use cases here. I wanted an easy to understand and replicate demo, you want a standalone system.

Perhaps the right solution is to just have the original scripts as they are for the demo/benchmarking purpose and then your package that is for the production use. What do you think? There will be a bit of duplication, but as you're saying you don't care for solutions that are not fast enough, which is fair for your needs.

What crashes?

4. batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now).

5. I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.

for end user scripts there should be no need to import them.

Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM. I can open up another PR to support other models. Adding so much functionality in a single PR might not be the best thing to do.

agreed, let's stick to bloom for now. And that's why I suggested that then the naming of the scripts should indicate that it's bloom and your version doesn't.

Aug 17 '22 19:08 stas00

you don't need to broadcast anything for non-server usage. It sounds like you built cli.py as a client to the server, but that's too complicated for someone who just wants to process a few inputs.

cli works this way. The user only needs to provide input_text. Also, I need to call generate method on all 8 GPU processes (in non-server usage), and since the input text will only be requested on 1 process. You need to broadcast it as far as I understand. Yes cli for ds-inference is built as a client to the server. I can change it though if we want. In the original ds-inference script @stas00 , you didnt need to broadcast since each process was creating the list.

I will incorporate your suggestions.

Aug 17 '22 20:08 mayank31398

So, just to summarize for cli.py: You want the option for user to provide an input text file?

Aug 17 '22 20:08 mayank31398

Also, @stas00 , I am fine with both. If we want, we can keep this branch as a separate branch if users want to leverage a standalone system and maybe just add an instruction to the README in main branch to use this branch for deployment purposes. What do you think?

Aug 17 '22 21:08 mayank31398

branch, no, as branches are hard to keep in sync - and especially since we will move it out of Meg-DS anyway once you're done . I propose let's create 2 sub-folders

bloom-inference-scripts
bloom-inference-server

one for the straight scripts - i.e. move the current ones there instead of deleting those.
the 2nd is for your sophisticated set up - and there you don't need to support ZeRO at all. And you can leave everything as it is now in this PR.

I think that would allow meeting different use cases w/o getting in the way of each other.

I know I was the one who proposed to refactor to support your server work, but now I can see that keeping both versions is probably the simplest design choice.

Aug 17 '22 21:08 stas00

So, just to summarize for cli.py: You want the option for user to provide an input text file?

If we keep the original scripts as they there are I think we can keep your version as it is for now w/o any cmd args changes. If someone asks for it it can be added later.

Aug 17 '22 22:08 stas00

Thanks for the comments Stas. Will incorporate the changes

I have added back the older scripts in scripts/inference folder (original path). Will update README.md

Aug 18 '22 07:08 mayank31398

I have added back the older scripts in scripts/inference folder (original path).

and let's move the 2 solutions to their respective sub-dirs:

bloom-inference-scripts
bloom-inference-server

we don't want them to be together, as it'd be difficult to understand quickly.

thank you, @mayank31398

Aug 19 '22 16:08 stas00

I have added back the older scripts in scripts/inference folder (original path).

and let's move the 2 solutions to their respective sub-dirs:
bloom-inference-scripts
bloom-inference-server
we don't want them to be together, as it'd be difficult to understand quickly.

thank you, @mayank31398

I assume these 2 folders should be located inside scripts/ ? or in the root folder?

Aug 19 '22 17:08 mayank31398

I assume these 2 folders should be located inside scripts/

yes, please.

Aug 19 '22 17:08 stas00

Done. Added README too @stas00

Aug 20 '22 11:08 mayank31398

I want to remove the temporary json created while caching. However, the new commit in DeepSpeed master branch is not working for me. https://github.com/microsoft/DeepSpeed/pull/2132#issuecomment-1221592051 Waiting for a fix @RezaYazdaniAminabadi

Aug 21 '22 19:08 mayank31398

Looks great, @mayank31398 - thank you!

I need to think how I could test the functionality. Probably might work from the local client on the same host (as the node I have access to has no outside access).

Aug 22 '22 23:08 stas00

I see @stas00 I have a script that does that for you. Are you saying the node you have can only be accessed from a login node or something? If that is how your setup is, I have a similar setup. Our compute nodes have no outside access, only the login node. If I am misunderstanding the probem, ignore this :trollface: I use the following script to test from my local machine:

from sshtunnel import SSHTunnelForwarder
import http
import requests
import paramiko


class BloomServer:
    
    def __init__(self, bloom_ip="<IP HERE>", bloom_port=<PORT HERE>, relative_path="/generate/",
                 private_key_path = "<PATH TO ID_RSA>"): 
        
        pkey = paramiko.RSAKey.from_private_key_file(private_key_path)
        self.remote_host = "<SSH IP LOGIN NODE>"
        self.remote_user = "<USERNAME>"
        self.remote_port = <PORT FOR LOGIN NODE>
        self.bloom_ip = bloom_ip
        self.bloom_port = bloom_port
        self.server = SSHTunnelForwarder(
                ssh_address_or_host=self.remote_host,
                ssh_username=self.remote_user,
                ssh_private_key=pkey,
                remote_bind_address=(self.bloom_ip, self.bloom_port),
                ) 
        self.server.skip_tunnel_checkup = False
        self.server.start()
        self.server.check_tunnels()
        print(self.server.tunnel_is_up, flush=True)
        
        server_local_url =  "127.0.0.1:" + str(self.server.local_bind_address[1])
        self.local_url = "http://" + server_local_url + relative_path

    def get_response(self, prompt):
        min_length = 1
        top_k = 5
        top_p =  1
        temperature = 1
        max_new_tokens = 10
        request_body = {
            "text": prompt,
            "min_length": min_length,
            "top_p": top_p,
            "top_k":top_k,
            "temperature": temperature,
            "max_new_tokens": max_new_tokens
        }
        try:
            response = requests.post(self.local_url, json=request_body, verify=False)
            return response.json(), "OK"
        except:
            return {}, "failed"


if __name__=="__main__":
    server = BloomServer()
    prompt = "I need some help with the"
    output = server.get_response(prompt)
    print("Prompt:", prompt)
    print("Output", output)

Aug 23 '22 00:08 mayank31398

Also, switching from Flask to Starlette + FastAPI fixed the memory leak.

Aug 23 '22 01:08 mayank31398

so is there a memory leak still or not? You shared there still was but then removed the comment.

If there is we should try to isolate it - by first identifying whether it's in accelerate or the transformers modeling code by simply repeating the generate call.

Aug 23 '22 16:08 stas00

So, i don't really think its a memory leak @stas00 https://github.com/huggingface/accelerate/issues/614 I want to look more into this.

THis is what I plan to do: https://github.com/huggingface/accelerate/issues/614#issuecomment-1224285433 Should give us an idea of whether things are good to go or not.

Aug 23 '22 17:08 mayank31398

For now, I will investigate this further. Lets not merge this PR yet :)

Aug 23 '22 17:08 mayank31398

It looks like the transformers internal APIs have changed and the old hack I used for get_checkpoint_files no longer works.

Here is the new drop in replacement and the good thing about it - it's no longer a hack and should continue working:

from pathlib import Path
from transformers.utils import is_offline_mode
from huggingface_hub import snapshot_download

def get_checkpoint_files(model_name_or_path, revision=None):
    # checks if online or not
    if is_offline_mode():
        print("Offline mode: forcing local_files_only=True")
        local_files_only = True
    else:
        local_files_only = False

    # loads files from hub
    cached_repo_dir = snapshot_download(model_name_or_path, allow_patterns=["*"], local_files_only=local_files_only, revision=revision)

    # creates a list of paths from all downloaded files in cache dir, matching the regex *.pt
    file_list = [str(entry) for entry in Path(cached_repo_dir).rglob('*.pt') if entry.is_file()]
    return file_list

you should be able to use it if you update your transformers clone to the the latest.

credits: @philschmid

p.s. we should probably add a minimal version requirement for huggingface_hub>=0.9.0 when this is ready to merge. it's already released so you can just add it now as well.

from transformers.utils.versions import require_version
require_version("huggingface_hub>=0.9.0")

p.s. note to myself it doesn't seem to currently respect TRANSFORMERS_CACHE env var, as it's internal to transformers need to figure it out.

Aug 23 '22 23:08 stas00

when I run:

python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'

the first input went ok, but when it returned it messed things up:

Loading model...
Model loaded
Input text: whoah!
change generate_kwargs? [y/n] y
Generate kwargs: {"min_length": 100, "max_new_tokens": 100, "do_sample": false}

Output text: whoah!"

"What's the matter?" asked the other.

"Matter?" repeated the first. "Why, I thought I heard a shot."

"So did I," said the other. "But it was only a gun."

"Only a gun!" exclaimed the first. "Then it must have been a gun."

"It was," said the other. "But it was only a gun."

"Only a gun!" repeated
Generated tokens: 100
Input text: change generate_kwargs? [y/n]

as you can see the last line it mixed 2 inputs together

I think it's because I was impatient and entered 'Enter' at some point.

but it shouldn't allow an empty prompt, should it?

Perhaps it should consume all inputs before generating a new prompt, so that only the response to the prompt is used?

Also it'd be nice if the cli.py exited cleanly on Ctrl-C and not continue running and requiring multiple breaks:

Input text: change generate_kwargs? [y/n] ^CTraceback (most recent call last):
  File "scripts/bloom-inference-server/cli.py", line 80, in <module>
    main()
  File "scripts/bloom-inference-server/cli.py", line 61, in main
    if (input("change generate_kwargs? [y/n] ") == "y"):
KeyboardInterrupt

^C
^C
^C

Aug 24 '22 00:08 stas00

The DS-inference server crashes for me w/o the cached tp checkpoint

deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference  --benchmark_cycles 5
[...]
Traceback (most recent call last):
  File "scripts/bloom-inference-server/benchmark.py", line 188, in <module>
    main()
  File "scripts/bloom-inference-server/benchmark.py", line 179, in main
    benchmark_end_to_end(args, DSInferenceModel)
  File "scripts/bloom-inference-server/benchmark.py", line 75, in benchmark_end_to_end
    model, initialization_time = run_and_log_time(
  File "scripts/bloom-inference-server/benchmark.py", line 37, in run_and_log_time
    results = execs()
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/utils/utils.py", line 30, in __call__
    return self.func(**self.kwargs)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/model.py", line 57, in __init__
    self.model = deepspeed.init_inference(
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/__init__.py", line 293, in init_inference
    engine = InferenceEngine(model,
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/inference/engine.py", line 154, in __init__
    self.module.to(device)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

Aug 24 '22 01:08 stas00

The DS-inference server crashes for me w/o the cached tp checkpoint

deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference  --benchmark_cycles 5
[...]
Traceback (most recent call last):
  File "scripts/bloom-inference-server/benchmark.py", line 188, in <module>
    main()
  File "scripts/bloom-inference-server/benchmark.py", line 179, in main
    benchmark_end_to_end(args, DSInferenceModel)
  File "scripts/bloom-inference-server/benchmark.py", line 75, in benchmark_end_to_end
    model, initialization_time = run_and_log_time(
  File "scripts/bloom-inference-server/benchmark.py", line 37, in run_and_log_time
    results = execs()
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/utils/utils.py", line 30, in __call__
    return self.func(**self.kwargs)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/model.py", line 57, in __init__
    self.model = deepspeed.init_inference(
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/__init__.py", line 293, in init_inference
    engine = InferenceEngine(model,
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/DeepSpeed/deepspeed/inference/engine.py", line 154, in __init__
    self.module.to(device)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/gpfswork/rech/six/commun/conda/inference/lib/python3.8/site-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

Already on it: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328#issuecomment-1221609552 This is a bug in deepspeed: https://github.com/microsoft/DeepSpeed/pull/2132#issuecomment-1221592051

@RezaYazdaniAminabadi @jeffra

Aug 24 '22 03:08 mayank31398

I feel like empty prompt is fine. This is basically unconditional generation right? I will try to fix the keyboard interrupt.

Aug 24 '22 07:08 mayank31398

I feel like empty prompt is fine. This is basically unconditional generation right?

Good question - I have never tried it - I guess it should work.

Aug 24 '22 20:08 stas00

Also, @stas00 I have been meaning to ask. Is accelerate using DeepSpeed ZeRO as its backend? Because if that is the case then the generation time per batch for both of them differ greatly right?

Aug 24 '22 22:08 mayank31398

Megatron-DeepSpeed Megatron-DeepSpeed copied to clipboard

Add generation server scripts using HF accelerate and DS-inference

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard