grimulkan
grimulkan
I didn't change the command line, but you mean something might have overridden it? I'll reload the model in the UI and check. The startup message showed the same model...
Common to both runs: ``` Command Line: python server.py --chat --cpu-memory 200GiB --auto-devices --listen-port 6565 --wbits 4 --groupsize 128 --model vicuna-13b-4bit-128g --model_type LLaMA Prompt: Write a paragraph about the state...
Just tried with this command line (no auto devices and CPU mem): `python server.py --chat --listen-port 6565 --wbits 4 --groupsize 128 --model vicuna-13b-4bit-128g --model_type LLaMA` Same results as before unfortunately...
No, actually you were correct. Somehow it manage to put it on multiple GPUs. Forcing CUDA_VISIBLE_DEVICES=0 got it to work. On newer commit: `Output generated in 11.20 seconds (11.61 tokens/s,...
Confirmed: it works fine on the latest commit as long as I set CUDA_VISIBLE_DEVICES=0. Even with my original command line. Guess I'll just manually enable CUDA devices & control the...
You can do it in the batch file in Windows that launches the web ui. After you call activate.bat, set CUDA_VISIBLE_DEVICES=0 (export instead of set in Linux/WSL)
Adding to that (related question). Looks like the webUI actually inputs the following format in instruct mode: (slightly different than my case examples in that the extra prompt is part...
I am not sure we need them to be dynamic. YaRN works both ways? The static version I described above still computes the positional table once at tge start, just...
By ‘dynamic’, the paper means something that changes the rope scaling depending on the actual context size (only compresses when context exceeds original pre-trained size). This is optional. They have...
I see. Does that also mess with methods that change the position embeddings by hidden dimension like YaRN?