tevatron icon indicating copy to clipboard operation
tevatron copied to clipboard

Train in multi-node multi-card environment

Open Atlantic8 opened this issue 10 months ago • 2 comments

can I use tevatron to train models in multi-node multi-card environment ? if yes, could you please give script examples to demonstrate how to start the job, thank you

Atlantic8 avatar Mar 27 '24 08:03 Atlantic8

Hi @Atlantic8, for pytorch implementation, unfortunately, we didn't get chance to run&test on multi-node environment yet.

MXueguang avatar Mar 27 '24 14:03 MXueguang

add on top of it, for jax it really depends on your cluster. for cloud TPU,

gcloud compute tpus tpu-vm ssh YOUR_TPU_NAME \
    --zone=us-central2-b \
    --worker=all \
    --command="python -m tevatron.tevax.experimental.mp.train ..."

Adjust it to use the launch script that fits your cluster config.

luyug avatar Mar 28 '24 14:03 luyug