nntrainer
nntrainer copied to clipboard
restructure the multi-head attention layer
We can optimize the memory consumption of the multi-head attention layer by combination of layers. By doing this, we could reduce the memory further.
- compute multi-head one by one.
- re-implement the multi-head attention layer as an backbone layer.
:octocat: cibot: Thank you for posting issue #1998. The person in charge will reply soon.
To list for 1
- [ ] Enhance split layer to split input by given number(number of head). #2025
- [ ] Replace a multi head attention layer by make a sub-graph
- [ ] Compare the peak memory consumption and latency before and after changes
- [ ] Compare the peak memory consumption and latency before and after enabling the swap feature