Reduce disk usage for mixtral tests
This PR modifies the parameter conversion mixtral tests to go through gcsfuse instead of disk for lower VM disk usage
Also just a high-level question - @rdyro did you notice any slowdown using gcsfuse compared to a local copy of the weights?
Also just a high-level question - @rdyro did you notice any slowdown using gcsfuse compared to a local copy of the weights?
I did not check quantitatively, but qualitatively, there wasn't much of a difference. The loading of the existing checkpoint to be converted takes less than 10% of the total run, even after moving to gcsfuse.
If it's important we measure this (e.g., for elsewhere), let me know
Also just a high-level question - @rdyro did you notice any slowdown using gcsfuse compared to a local copy of the weights?
I did not check quantitatively, but qualitatively, there wasn't much of a difference. The loading of the existing checkpoint to be converted takes less than 10% of the total run, even after moving to gcsfuse.
If it's important we measure this (e.g., for elsewhere), let me know
Correction: the new strategy via gcsfuse is equivalent to single-threaded download which on a high-end VM caps at around 80-100 MB/s on average, implying 45 min - 60 min download for a 280 GB checkpoint (8x22b)
This test was also sometimes (?) running out of RAM without weight_dtype=bfloat16, I'm adding that change to this PR