torchft
torchft copied to clipboard
make torchft work for llama3_8b 8x
as titled
it goes fast
Test plan:
Testing w/ 12 GB of 64 mb tensors
baseline
took 30.493701454252005 seconds
With streaming transfer
0 chunks
took 8.783997897058725 seconds
10 chunks
took 2.8615125976502895 seconds
20 chunks
took 2.433052882552147 seconds