aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

[Usage]: Higher Context Length.

Open Abulhanan opened this issue 1 year ago • 2 comments

I am using a gguf model on Aphrodite Engine but the issue is that i was to have context length of 8192 ctx but i can got it to load only about 4096 context length, issue is that i'm short on vram... but i have only 30Gb vram, any way i can offload it into cpu ram???

Script i'm using

Your current environment

Model = "Abdulhanan2006/LogicLumina-IQ2_M" Revision = "main" Quantization = "gguf" GPU_Memory_Utilization = 1 Context_Length = 4096 launch_kobold_api = False OpenAI_API_Key = "" FP8_KV_Cache = False

#Aphrodite Engine %pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y !echo "Installing/Updating the Aphrodite Engine, this may take a while..." %pip install aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl > /dev/null 2>&1 !echo "Installation successful! Starting the engine now."

#%pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y #!echo "Installing/Updating the Aphrodite Engine, this may take a while..." #%pip install aphrodite-engine==0.5.1 > /dev/null 2>&1 #!echo "Installation successful! Starting the engine now."

#RAY !pip install -U "ray[all]" !pip install grpcio==1.62.1

#Ngrok !pip3 install pyngrok !echo "Creating a Ngrok URL..." from pyngrok import ngrok !ngrok authtoken 2gAo4R6fC0ND3YnulbgncrYrvLx_6E28TkGZZJTeT58MJ6GQY

##Aphrodite model = Model gpu_memory_utilization = GPU_Memory_Utilization context_length = Context_Length api_key = OpenAI_API_Key quant = Quantization kobold = launch_kobold_api revision = Revision fp8_kv = FP8_KV_Cache

command = [ "python", "-m", "aphrodite.endpoints.openai.api_server", "--dtype", "float16", "--model", model, "--host", "127.0.0.1", "--max-log-len", "0", "--gpu-memory-utilization", str(gpu_memory_utilization), "--max-model-len", str(context_length), "--tensor-parallel-size","2", "--enable-chunked-prefill" ]

if kobold: command.append("--launch-kobold-api")

if quant != "None": command.extend(["-q", quant])

if fp8_kv: command.append("--kv-cache-dtype fp8")

if api_key != "": command.extend(["--api-keys", api_key])

!ngrok http --domain=vertically-amazed-spider.ngrok-free.app 2242 & {" ".join(command)}

Abulhanan avatar May 28 '24 13:05 Abulhanan

You can try FP8 KV cache or use chunked prefill (mutually exclusive for now).

--kv-cache-dtype fp8 | --enable-chunked-prefill

AlpinDale avatar May 28 '24 15:05 AlpinDale

okay, but will they slow something down? or any accuracy loss?

Abulhanan avatar May 28 '24 15:05 Abulhanan