tevatron
tevatron copied to clipboard
Train in multi-node multi-card environment
can I use tevatron to train models in multi-node multi-card environment ? if yes, could you please give script examples to demonstrate how to start the job, thank you
Hi @Atlantic8, for pytorch implementation, unfortunately, we didn't get chance to run&test on multi-node environment yet.
add on top of it, for jax it really depends on your cluster. for cloud TPU,
gcloud compute tpus tpu-vm ssh YOUR_TPU_NAME \
--zone=us-central2-b \
--worker=all \
--command="python -m tevatron.tevax.experimental.mp.train ..."
Adjust it to use the launch script that fits your cluster config.