Brian Keene comments

Repositories
Issues
Comments

Results 4 comments of


                                            Brian Keene

trafficstars

Metal shaders for efficient self attention on large sequences

Marking as draft, currently working through some numerical issues via separate workflow, and will add CPU side bindings + dispatch, test & docs - sharing a current status and will...

Metal shaders for efficient self attention on large sequences

Hi folks, Attaching some graphs for measured latency on M3 Max and some estimated memory savings per attention block (empirically observed at several data points, graph here obtained via formulas)

Metal shaders for efficient self attention on large sequences

Some room for improvement on larger sequences re: latency, with a divergence after ~2300 sequence length, though the memory savings exceeds 1GB ~2k, and is approaching 5GB at 4250 sequence...

Metal shaders for efficient self attention on large sequences

Updated with the requested changes, thank you for the prompt review!