grimulkan comments

Results 11 comments of


                                            grimulkan

10x slowdown using Vicuna 13b 4bit commit f2bf1a2 onwards

I didn't change the command line, but you mean something might have overridden it? I'll reload the model in the UI and check. The startup message showed the same model...

10x slowdown using Vicuna 13b 4bit commit f2bf1a2 onwards

Common to both runs: ``` Command Line: python server.py --chat --cpu-memory 200GiB --auto-devices --listen-port 6565 --wbits 4 --groupsize 128 --model vicuna-13b-4bit-128g --model_type LLaMA Prompt: Write a paragraph about the state...

10x slowdown using Vicuna 13b 4bit commit f2bf1a2 onwards

Just tried with this command line (no auto devices and CPU mem): `python server.py --chat --listen-port 6565 --wbits 4 --groupsize 128 --model vicuna-13b-4bit-128g --model_type LLaMA` Same results as before unfortunately...

10x slowdown using Vicuna 13b 4bit commit f2bf1a2 onwards

No, actually you were correct. Somehow it manage to put it on multiple GPUs. Forcing CUDA_VISIBLE_DEVICES=0 got it to work. On newer commit: `Output generated in 11.20 seconds (11.61 tokens/s,...

10x slowdown using Vicuna 13b 4bit commit f2bf1a2 onwards

Confirmed: it works fine on the latest commit as long as I set CUDA_VISIBLE_DEVICES=0. Even with my original command line. Guess I'll just manually enable CUDA devices & control the...

10x slowdown using Vicuna 13b 4bit commit f2bf1a2 onwards

You can do it in the batch file in Windows that launches the web ui. After you call activate.bat, set CUDA_VISIBLE_DEVICES=0 (export instead of set in Linux/WSL)

EOS token question in multi-round in OASST

Adding to that (related question). Looks like the webUI actually inputs the following format in instruct mode: (slightly different than my case examples in that the extra prompt is part...

YaRN Support

I am not sure we need them to be dynamic. YaRN works both ways? The static version I described above still computes the positional table once at tge start, just...

YaRN Support

By ‘dynamic’, the paper means something that changes the rope scaling depending on the actual context size (only compresses when context exceeds original pre-trained size). This is optional. They have...

YaRN Support

I see. Does that also mess with methods that change the position embeddings by hidden dimension like YaRN?