llama-gpt icon indicating copy to clipboard operation
llama-gpt copied to clipboard

replies / slow

Open WEBELSYS opened this issue 1 year ago • 16 comments

how to configure it to show replies in real time and not wait for the end of the generation.

Thanks

WEBELSYS avatar Aug 18 '23 17:08 WEBELSYS

Currently the replies are already streamed one word at a time. I wonder if the first word's taking a lot of time for you? In that case, consider running the 7B model (if you aren't already) to see increased performance.

mayankchhabra avatar Aug 18 '23 17:08 mayankchhabra

@mayankchhabra I think this is a bug. I am having the same issue

AndreiSva avatar Aug 18 '23 17:08 AndreiSva

with me he really waits for the generation to be finished to give him the text

WEBELSYS avatar Aug 18 '23 17:08 WEBELSYS

@AndreiSva and @WEBELSYS can you please share which model you're trying, and the specs of your hardware (OS, CPU, RAM)?

mayankchhabra avatar Aug 18 '23 17:08 mayankchhabra

for a few days at first he was doing word for word. but now he waits for the end. 7b - 13b ryzen 5800x3d 32gb ddr4

WEBELSYS avatar Aug 18 '23 17:08 WEBELSYS

I am running on linux on a ryzen 7 3700X, 32 gigabytes of ram.

AndreiSva avatar Aug 18 '23 17:08 AndreiSva

Everything's also running extremely slow. Here's a screen record of what a simple generation looks like. Screencast from 2023-08-18 10-31-58.webm

it took almost 3 minutes

AndreiSva avatar Aug 18 '23 17:08 AndreiSva

About the same or worse with one word at a time, running 70B on EPYC 7502P, 128 GB RAM , Ubuntu 22.04.

rsebi avatar Aug 18 '23 18:08 rsebi

Same for me. I'm using an old laptop (i7 4800 mq and 8gb of ram, ssd) and it's very very slow with the 7B model. I know the laptop is not powerful, but it should be able to give a simple reply... or not? Thank you

theRAGEhero avatar Aug 19 '23 11:08 theRAGEhero

Speed is fine on my platform, but there are no streaming tokens. Using Ryzen Threadripper 2950X & 32G of RAM on Fedora.

JamieMair avatar Aug 19 '23 12:08 JamieMair

I waited 10 minutes, but nothing happened. Never saw any output, but my CPU spiked at 100 %

hundehausen avatar Aug 20 '23 07:08 hundehausen

If the connection is direct, the response is one word at a time, but if using nginx reverse proxy, it needs to wait until the whole sentence is generated.

By the way, after using CUDA acceleration, the generation speed has been significantly improved.

Aincvy avatar Aug 20 '23 11:08 Aincvy

That's a great observation @Aincvy. For anyone facing this issue, can you please confirm if you're using LlamaGPT behind a reverse proxy, like nginx? If so, would be great if you can paste your proxy config, you might need to make some adjustments to make streaming work.

mayankchhabra avatar Aug 21 '23 10:08 mayankchhabra

this is indeed it! directly linked IP:3000, much faster and streaming

WEBELSYS avatar Aug 21 '23 14:08 WEBELSYS

@WEBELSYS @mayankchhabra can you share the config you used? I am using a basic proxy pass and it is showing the issues stated above:

` server { listen 80; server_name chat.randomprivateurl.com;

    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/chat.randomprivateurl.com/fullchain.pem>
    ssl_certificate_key /etc/letsencrypt/live/chat.randomprivateurl.com/privkey.p>
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}

`

Agility0493 avatar Aug 21 '23 15:08 Agility0493

I'm using nginx too, even with "server sent event" tweak in nginx config, it still does not work: https://stackoverflow.com/questions/13672743/eventsource-server-sent-events-through-nginx

ngxson avatar Aug 31 '23 07:08 ngxson