Brian Keene

Results 4 comments of Brian Keene
trafficstars

Marking as draft, currently working through some numerical issues via separate workflow, and will add CPU side bindings + dispatch, test & docs - sharing a current status and will...

Hi folks, Attaching some graphs for measured latency on M3 Max and some estimated memory savings per attention block (empirically observed at several data points, graph here obtained via formulas)

Some room for improvement on larger sequences re: latency, with a divergence after ~2300 sequence length, though the memory savings exceeds 1GB ~2k, and is approaching 5GB at 4250 sequence...

Updated with the requested changes, thank you for the prompt review!