Elad Dvash

Results 1 issues of Elad Dvash

Is gradient accumulation still mathematically possible in this scheme? Because if so we could train 65b on a single 3090 in a day and a half