Chen Jie

Results 1 comments of Chen Jie

The first graph is a comparison between using and not using flash attention 2. It seems that the loss doesn't change much with fa2 (yellowish curve).