Chen Jie
Results
1
comments of
Chen Jie
The first graph is a comparison between using and not using flash attention 2. It seems that the loss doesn't change much with fa2 (yellowish curve).