cloud11665
cloud11665
Also doing formatting as it's easier to compare ptx to cuda backend code. orange - instruction or immediate value purple - state space blue - operand or identifier green -...
I will post llama timing benchmarks soon.
Looks like https://bangplayer.live is down
repro steps (on a 2x4090 machine) `CUDA_VISIBLE_DEVICES=1 NV=1 DEBUG=1 python3 -m examples.hlb_cifar10` -> gpu0 gets loaded `CUDA_VISIBLE_DEVICES=1 CUDA=1 DEBUG=1 python3 -m examples.hlb_cifar10` -> gpu1 gets loaded
### What happened? The limit is respected when requesting a chat completion, but for non-chat ones, the model keeps generating tokens forever (until ctx-len is reached). With non-streaming there is...
output of `corefreq-cli -k -n -B -n -M` ``` Linux: |- Release [6.8.0-57-generic] |- Version [#59-Ubuntu SMP PREEMPT_DYNAMIC Sat Mar 15 17:40:59 UTC 2025] |- Machine [x86_64] Memory: |- Total...