nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

is Tesla M60 slower than M1 ?

Open williambrach opened this issue 1 year ago • 0 comments

Hi,

I'm experimenting with fine-tuning nanoGPT and I have a question about performance. I have a MacBook Pro with an M1 chip and an Azure Compute instance with a Tesla M60. I've tried running the fine-tuning process three times, and it seems the M1 is way faster than the Tesla M60. Why is this the case?

Here are the commands I'm using:

M1 -

python train.py config/finetune_shakespeare.py --device=mps --compile=False

Overriding config with config/finetune_shakespeare.py:
import time

out_dir = "out-shakespeare"
eval_interval = 5
eval_iters = 40
wandb_log = False  # feel free to turn on
wandb_project = "shakespeare"
wandb_run_name = "ft-" + str(time.time())

dataset = "shakespeare"
init_from = "gpt2"  # this is the largest GPT-2 model

# only save checkpoints if the validation loss improves
always_save_checkpoint = False

# the number of examples per iter:
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# finetune at constant LR
learning_rate = 3e-5
decay_lr = False

Overriding: device = mps
Overriding: compile = False
tokens per iteration will be: 32,768
Initializing from OpenAI GPT-2 weights: gpt2
loading weights from pretrained gpt: gpt2
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 123.65M
/Users/williambrach/miniforge3/envs/nanoGPT/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py:120: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
using fused AdamW: False
step 0: train loss 4.1862, val loss 4.0315
iter 0: loss 4.6551, time 30018.48ms, mfu -100.00%
iter 1: loss 3.6377, time 17735.89ms, mfu -100.00%
iter 2: loss 3.7499, time 17836.98ms, mfu -100.00%
iter 3: loss 3.8432, time 17833.04ms, mfu -100.00%
iter 4: loss 3.7691, time 17440.80ms, mfu -100.00%

Tesla M60 -

python train.py config/finetune_shakespeare.py --device=cuda --compile=False

Overriding config with config/finetune_shakespeare.py:
import time

out_dir = "out-shakespeare"
eval_interval = 5
eval_iters = 40
wandb_log = False  # feel free to turn on
wandb_project = "shakespeare"
wandb_run_name = "ft-" + str(time.time())

dataset = "shakespeare"
init_from = "gpt2"  # this is the largest GPT-2 model

# only save checkpoints if the validation loss improves
always_save_checkpoint = False

# the number of examples per iter:
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
batch_size = 1
gradient_accumulation_steps = 32
max_iters = 20

# finetune at constant LR
learning_rate = 3e-5
decay_lr = False

tokens per iteration will be: 32,768
Initializing from OpenAI GPT-2 weights: gpt2
gpt2
loading weights from pretrained gpt: gpt2
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 123.65M
using fused AdamW: True
step 0: train loss 4.1876, val loss 4.0327
iter 0: loss 4.6578, time 95268.65ms, mfu -100.00%
iter 1: loss 3.7266, time 71167.48ms, mfu -100.00%
iter 2: loss 3.8189, time 71159.81ms, mfu -100.00%
iter 3: loss 4.0685, time 71280.82ms, mfu -100.00%
iter 4: loss 4.3062, time 71683.98ms, mfu -100.00%

I need to run compile=False because of this error (cuda version 11.4)-

    raise RuntimeError(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found Tesla M60 which is too old to be supported by the triton GPU compiler, which is used as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 5.2

williambrach avatar May 01 '23 14:05 williambrach