Elad Dvash
Results
1
issues of
Elad Dvash
Is gradient accumulation still mathematically possible in this scheme? Because if so we could train 65b on a single 3090 in a day and a half