[Usage]: Higher Context Length.
I am using a gguf model on Aphrodite Engine but the issue is that i was to have context length of 8192 ctx but i can got it to load only about 4096 context length, issue is that i'm short on vram... but i have only 30Gb vram, any way i can offload it into cpu ram???
Script i'm using
Your current environment
Model = "Abdulhanan2006/LogicLumina-IQ2_M" Revision = "main" Quantization = "gguf" GPU_Memory_Utilization = 1 Context_Length = 4096 launch_kobold_api = False OpenAI_API_Key = "" FP8_KV_Cache = False
#Aphrodite Engine %pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y !echo "Installing/Updating the Aphrodite Engine, this may take a while..." %pip install aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl > /dev/null 2>&1 !echo "Installation successful! Starting the engine now."
#%pip show aphrodite-engine &> /dev/null && echo "Existing Aphrodite Engine installation found. Updating..." && pip uninstall aphrodite-engine -q -y #!echo "Installing/Updating the Aphrodite Engine, this may take a while..." #%pip install aphrodite-engine==0.5.1 > /dev/null 2>&1 #!echo "Installation successful! Starting the engine now."
#RAY !pip install -U "ray[all]" !pip install grpcio==1.62.1
#Ngrok !pip3 install pyngrok !echo "Creating a Ngrok URL..." from pyngrok import ngrok !ngrok authtoken 2gAo4R6fC0ND3YnulbgncrYrvLx_6E28TkGZZJTeT58MJ6GQY
##Aphrodite model = Model gpu_memory_utilization = GPU_Memory_Utilization context_length = Context_Length api_key = OpenAI_API_Key quant = Quantization kobold = launch_kobold_api revision = Revision fp8_kv = FP8_KV_Cache
command = [ "python", "-m", "aphrodite.endpoints.openai.api_server", "--dtype", "float16", "--model", model, "--host", "127.0.0.1", "--max-log-len", "0", "--gpu-memory-utilization", str(gpu_memory_utilization), "--max-model-len", str(context_length), "--tensor-parallel-size","2", "--enable-chunked-prefill" ]
if kobold: command.append("--launch-kobold-api")
if quant != "None": command.extend(["-q", quant])
if fp8_kv: command.append("--kv-cache-dtype fp8")
if api_key != "": command.extend(["--api-keys", api_key])
!ngrok http --domain=vertically-amazed-spider.ngrok-free.app 2242 & {" ".join(command)}
You can try FP8 KV cache or use chunked prefill (mutually exclusive for now).
--kv-cache-dtype fp8 | --enable-chunked-prefill
okay, but will they slow something down? or any accuracy loss?