dvoidus

Results 11 comments of dvoidus

I think this is due to API change, seed param is missing, replace response call with: ``` response = requests.post(f"http://{ai_server_ip}:{ai_server_port}/run/textgen", json={ "data": [ prompt, params['max_new_tokens'], params['do_sample'], params['temperature'], params['top_p'], params['typical_p'], params['repetition_penalty'],...

1c2513 is fine, keeps generation at 34t/s f97561 already see a drop to 30t/s

``` config.matmul_recons_thd = 8 config.fused_mlp_thd = 0 config.sdp_thd = 8 ``` still runs at 25t/s on the latest commit experimenting with `block_size_z` doesnt really make any difference (tried increasing it...

I also checked on my 4090, getting stable 38t/s on latest commit

btw, isn't it a typo in model.py? ``` # Tuning self.matmul_recons_thd = 8 self.fused_mlp_thd = 2 self.stp_thd = 8 ``` should be self.**sdp_thd**

I did a profile run (on h100) in case it could give you some hints: ``` ---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU...

> What cuda and pytorch version is this? Lots of the ops look very slow on a per call basis. 2.0.1+cu118

``` ----------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA...

@turboderp I did some extra profiling using nvidia-smi dmon --gpm-metrics flag, and you can clearly see the difference in utilisation between latest commit (401fa8) and commit 1c2513 (that has a...

I tried, but still see the same speeds (27t/s for the worst case) The only reproducible difference is context inference speed increased from 3500 t/s to 4500 t/s For the...