exo
exo copied to clipboard
Enable strict mode in configure_mlx.sh
Adding strict mode.
Also @AlexCheema didn't want to create an issue to comment this, so placing this here. On my POV benchmark (M4 max 128gb ram), running transformer pipeline on meta-llama-3-8b-Instruct, with 15 input tokens, stopping inference at 101 tokens, I'm seeing this (ttft: time to fist token, ts: tokens per second)
baseline:
- run 1: ttft 0.32s, ts 20.30 t/s
- run 2: ttft 0.31s, ts 20.43 t/s
- run 3: ttft 0.28s, ts 20.56 t/s
- run 4: ttft 0.28s, ts 20.54 t/s
- run 5: ttft 0.28s, ts 20.58 t/s
with this script:
- run 1: ttft 0.19s, ts 20.86 t/s
- run 2: ttft 0.18s, ts 20.80 t/s
- run 3: ttft 0.18s, ts 20.92 t/s
- run 4: ttft 0.18s, ts 20.88 t/s
- run 5: ttft 0.17s, ts 20.87 t/s
that's pretty cool. Thanks for this script :)