reduced grid size for insert kernel (training)

Open chowarfb opened this issue 2 years ago • 2 comments

Summary: lru_cache_insert kernel is not sensitive to #SMs (latency bound). Using less SMs avoids structural hazard on the main training stream. https://docs.google.com/document/d/1p3Id8HfVMfyFn4ZcL4e79Rl0ktTSevnW3jXm9PTy0ys/edit#bookmark=id.lyjw9rtmebv0

Given the performance optimized config is with pipelining, this diff changes the number of SMs (through limiting grid size) regardless of pipelined schema.

Reviewed By: jspark1105, q10

Differential Revision: D47781958

Aug 04 '23 20:08 chowarfb

Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
Latest commit	f8948122edd86ed800d508eeef036f650c2ae5d1
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/64cd60d14cc4d70008182fd5

Aug 04 '23 20:08 netlify[bot]

This pull request was exported from Phabricator. Differential Revision: D47781958

Aug 04 '23 20:08 facebook-github-bot

reduced grid size for insert kernel (training)

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Deploy Preview for pytorch-fbgemm-docs canceled.