litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Compatible with local 8xH100 instead of cloud?

Open michaellin99999 opened this issue 11 months ago • 4 comments

Hello. I have access to a local 8x H100 GPU cluster and want to try the Tinylama pretraining tutorial. Is this supported? or do i have utilize cloud GPUS?

Thanks

michaellin99999 avatar Mar 24 '24 11:03 michaellin99999

If you have all the dependencies installed, that should be supported. You can check out the tutorials/pretrain_tinyllama.md tutorial in this repo. Let us know what results you get, I'd be curious.

rasbt avatar Mar 24 '24 13:03 rasbt

is this the documentaiton that can help me set this up? https://lightning.ai/docs/pytorch/stable/clouds/cluster_expert.html.

or is there any other documentation suggesting how?

michaellin99999 avatar Mar 25 '24 03:03 michaellin99999

@michaellin99999 On a single H100 node you don't need to set anything up. You can just run the script (granted you follwed the tutorial preparation steps) and it will use all GPUs by default.

If you have a cluster of multiple H100 nodes, the steps will depend on your cluster setup. Most likely you have SLURM. Then follow the SLURM guide here: https://lightning.ai/docs/fabric/stable/fundamentals/launch.html#launch-on-a-cluster otherwise follow the "bare bones cluster" guide on that same page.

awaelchli avatar Mar 26 '24 19:03 awaelchli

Thank you!

michaellin99999 avatar Mar 27 '24 00:03 michaellin99999