nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Training on M1 "MPS"

Open okpatil4u opened this issue 2 years ago • 46 comments

Most of the people do not have access to 8XA100 40GB systems. But a single M1 Max laptop with 64 GB memory could host the training. How difficult is it to port this code to "MPS" ?

okpatil4u avatar Jan 08 '23 18:01 okpatil4u

I take it back. Seems like these are 8 x 40 GB systems.

There is a good paper on cramming [Cramming: Training a Language Model on a Single GPU in One Day] https://arxiv.org/abs/2212.14034

I thought some work on these lines was done here as well.

okpatil4u avatar Jan 08 '23 18:01 okpatil4u

Actually I think this issue is great to keep open, in case anyone investigates nanoGPT in mps context. I haven't tried yet.

karpathy avatar Jan 08 '23 18:01 karpathy

What is the actual memory requirement ? Will Mac Studio with 128 GB RAM be sufficient for training ?

okpatil4u avatar Jan 08 '23 18:01 okpatil4u

Refining the above comment slightly, do you currently have any (rough is fine) estimates on the relative sizes of the memory footprint for the just the model parameters, params plus the forward activations as a fn of bsz, versus the backward graph as a fn of bsz, on the 8xA100 40gb configuration? Where does it peak across the server during training?

That might start to inform some people on how to go about laying this out on the resources they have.

jwkirchenbauer avatar Jan 08 '23 19:01 jwkirchenbauer

Also relevant for inference.

NightMachinery avatar Jan 11 '23 21:01 NightMachinery

I haven't had a chance to do any benchmarking yet but training starts just fine on M1 Ultra with --device=mps.

personsg avatar Jan 12 '23 01:01 personsg

I use Google Colab with the smaller model https://github.com/acheong08/nanoGPT/tree/Colab - https://colab.research.google.com/github/acheong08/nanoGPT/blob/Colab/nanoGPT.ipynb

acheong08 avatar Jan 12 '23 07:01 acheong08

I tried out "i only have a MacBook" from README but with --device="mps" and it seems to run faster. With CPU, one iteration is roughly about 100ms whereas with mps is about ~40ms. My machine is a base line Mac Studio.

itakafu avatar Jan 16 '23 12:01 itakafu

That's for training a very small transformer. My machine is 64 gb RAM, M1 Max. For bert-medium like architecture, this is how it goes.

Overriding: dataset = shakespeare
Overriding: n_layer = 8
Overriding: n_head = 512
Overriding: n_embd = 512
Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 128

Initializing a new model from scratch
number of parameters: 50.98M
step 0: train loss 10.9816, val loss 10.9783
iter 0: loss 10.9711, time 4613.50ms
iter 1: loss 10.9673, time 5791.48ms
iter 2: loss 10.9647, time 7842.40ms
iter 3: loss 10.9646, time 10196.35ms
iter 4: loss 10.9604, time 11602.34ms
iter 5: loss 10.9495, time 9393.25ms
iter 6: loss 10.9615, time 10373.34ms

okpatil4u avatar Jan 16 '23 12:01 okpatil4u

@itakafu thank you for reporting, i'll add mentions of mps to the readme&code.

karpathy avatar Jan 16 '23 16:01 karpathy

test on MacBook Air M2, without charger:

with mps: roughly 150~200ms for one iteration without mps: roughly 450 ~ 500ms for one iteration

just for one reference

@itakafu thank you for reporting, i'll add mentions of mps to the readme&code.

SiyuanHuang95 avatar Jan 18 '23 11:01 SiyuanHuang95

Confirmed works great w/device='mps'. But make sure to install this version of pytorch:

$ pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

I'm getting <40ms

Thank you SO MUCH for this

tomeck avatar Jan 20 '23 21:01 tomeck

@tomeck Weird, I'm getting 300ms on M2 (Macbook Air 16GB):

python3 train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 --device='mps' --compile=False --eval_iters=1 --block_size=64 --batch_size=8
Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 64
Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 8
vocab_size not found in data/shakespeare/meta.pkl, using GPT-2 default of 50257
Initializing a new model from scratch
number of parameters: 3.42M
step 0: train loss 10.8177, val loss 10.8162
iter 0: loss 10.8288, time 438.06ms
iter 1: loss 10.8117, time 303.12ms
iter 2: loss 10.8236, time 301.04ms
iter 3: loss 10.8265, time 299.64ms
iter 4: loss 10.8128, time 299.96ms
iter 5: loss 10.8173, time 299.72ms
iter 6: loss 10.8066, time 300.76ms
iter 7: loss 10.8084, time 299.86ms
iter 8: loss 10.8244, time 299.47ms

tombenj avatar Jan 29 '23 16:01 tombenj

Just out of curiosity, I'm getting 17ms with a ryzen7 5700x and a 3060ti, 64 gb ram. What kind of iteration time does a A100 do? Are they horribly faster? I have a friend with 2x 3080s and I'm considering doing the big one...

coltac avatar Jan 29 '23 19:01 coltac

Yep the README documentation doesn't make sense in terms of ms calculations on A100. It states: "Training on an 8 x A100 40GB node for ~500,000 iters (~1 day) atm gets down to ~3.1"

This would mean - 500000/86400 = 5.787 itr / 1000 ms = 172.8 ms per itr. And times that by 8 to get a single A100... doesn't make sense.

tombenj avatar Jan 29 '23 19:01 tombenj

Oh I'm being stupid, I'm getting 17ms on Shakespeare, I bet it'd way higher on openwebtext

coltac avatar Jan 29 '23 21:01 coltac

Thanks to this thread I got it working on my M2 MacBook Pro - I wrote up some detailed notes here: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2

simonw avatar Feb 01 '23 19:02 simonw

I also built a little tool you can copy and paste the log output from training into to get a chart:

https://observablehq.com/@simonw/plot-loss-from-nanogpt

Example output:

image

simonw avatar Feb 01 '23 19:02 simonw

I think the mps section of the readme may be inaccurate: my understanding is that mps just utilizes the on-chip GPU. To use the Neural Engine you'd have to port it to CoreML — which may or may not speed up training but should do wonders for inference. See PyTorch announcement here.

strikeroot avatar Feb 10 '23 00:02 strikeroot

For training, you have to use MPS. For inference you can use ANE.

okpatil4u avatar Feb 13 '23 12:02 okpatil4u

Hey @simonw , thanks for sharing tutorial on your website!

I tried on my MacBook Air M2 and getting much worse performance:

time python3 train.py \
  --dataset=shakespeare \
  --n_layer=4 \
  --n_head=4 \
  --n_embd=64 \
  --compile=False \
  --eval_iters=1 \
  --block_size=64 \
  --batch_size=8 \
  --device=mps
Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 64
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 8
Overriding: device = mps
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 3.42M
using fused AdamW: False
step 0: train loss 10.8153, val loss 10.8133
iter 0: loss 10.8181, time 5264.63ms, mfu -100.00%
iter 1: loss 10.8291, time 1650.46ms, mfu -100.00%
iter 2: loss 10.8164, time 1651.38ms, mfu -100.00%
iter 3: loss 10.7927, time 1639.94ms, mfu -100.00%
iter 4: loss 10.8212, time 1644.10ms, mfu -100.00%
iter 5: loss 10.8067, time 1639.57ms, mfu 0.08%
iter 6: loss 10.8307, time 1635.84ms, mfu 0.08%
iter 7: loss 10.8345, time 1635.17ms, mfu 0.08%
iter 8: loss 10.8262, time 1637.88ms, mfu 0.08%
iter 9: loss 10.8275, time 1643.70ms, mfu 0.08%
iter 10: loss 10.8100, time 1643.38ms, mfu 0.08%
iter 11: loss 10.8100, time 1641.18ms, mfu 0.08%
iter 12: loss 10.8258, time 1647.17ms, mfu 0.08%
iter 13: loss 10.8169, time 1643.93ms, mfu 0.08%
iter 14: loss 10.8139, time 1645.54ms, mfu 0.08%
iter 15: loss 10.8107, time 1642.27ms, mfu 0.08%
iter 16: loss 10.8114, time 1642.16ms, mfu 0.08%
iter 17: loss 10.7969, time 1641.59ms, mfu 0.08%
iter 18: loss 10.8150, time 1643.31ms, mfu 0.08%

Currently on Python 3.11. Spent couple hours trying to reinstall everything but it didn't help. Does anyone have ideas what can be wrong here?

1234igor avatar Feb 15 '23 11:02 1234igor

Macbook M1 MAX results on train_shakespeare_char

python train.py config/train_shakespeare_char.py
Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
batch_size = 16
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 4
n_head = 4
n_embd = 256
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-6 # learning_rate / 10 usually
beta2 = 0.999 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
device = 'mps'  # run on cpu only
compile = False # do not torch compile the model

found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)

step 0: train loss 4.2326, val loss 4.2303
iter 0: loss 4.2329, time 9686.70ms, mfu -100.00%
step 5000: train loss 0.7204, val loss 1.5878
iter 5000: loss 0.9658, time 10224.29ms, mfu 0.48%

python sample.py --out_dir=out-shakespeare-char
Overriding: out_dir = out-shakespeare-char
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
number of parameters: 3.16M
Loading meta from data/shakespeare_char/meta.pkl...

The like order precious soner stout the morning's strength;
The month of his son bounded bones and rough
Since the common people'd courtesy 'gainst their times,
Your brats bear betwixt them away, and nothing
Against the gracious patern of their heads,
For their father is not their silly mouths,
Even in their voices and their loves.

MENENIUS:
You are received;
For they wear them, no more good to bed,
Your people have are endured with them not:
You'll have done as good to them be to brief

iSevenDays avatar Feb 20 '23 15:02 iSevenDays

It appears that after https://github.com/karpathy/nanoGPT/commit/086ebe1822791b775e951b4b562fbb7131d83cc2 was merged the training performance on M1/M2 is significantly slower.

deepaktalwardt avatar Feb 26 '23 19:02 deepaktalwardt

Thanks @deepaktalwardt!

I am using the command suggested by @simonw:

time python3 train.py \
  --dataset=shakespeare \
  --n_layer=4 \
  --n_head=4 \
  --n_embd=64 \
  --compile=False \
  --eval_iters=1 \
  --block_size=64 \
  --batch_size=8 \
  --device=mps

After reverting that commit this is literally flying on my Macbook Pro M2 Max! So just make sure the gradient_accumulation_steps is always equal to 1. Without reverting https://github.com/karpathy/nanoGPT/commit/086ebe1822791b775e951b4b562fbb7131d83cc2 it will be 800ms per iter.

Stopped training after 10k iters which took 4min18s.

iter 10139: loss 3.9768, time 25.31ms, mfu 0.13%

KeyboardInterrupt

python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64       232.40s user 72.33s system 117% cpu 4:18.81 total

nirajvenkat avatar Mar 11 '23 13:03 nirajvenkat

Has someone tried 'mps' together with 'compile=True' and succeed?

hanfluid avatar Mar 18 '23 06:03 hanfluid

+1 to reverting https://github.com/karpathy/nanoGPT/commit/086ebe1822791b775e951b4b562fbb7131d83cc2; I went from 1500ms to 70ms per iteration.

bcipolli avatar Mar 28 '23 04:03 bcipolli

indeed, I also made my own fork and reverted 086ebe1, resulting in a dramatic speedup on my Mac mini M1!

rozek avatar Mar 29 '23 06:03 rozek

Thanks to this thread I got it working on my M2 MacBook Pro - I wrote up some detailed notes here: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2

Simon, thank you very much for your walk-through of an installation of nanoGPT on Apple silicon. By the way, I just tried to run python sample.py after changing the device to mps and it seems to work now: the script ~~spits out a few warnings but then~~ generates output without any problems, but it has to be run under macOS 13.x Ventura.

rozek avatar Mar 29 '23 07:03 rozek

Has someone tried 'mps' together with 'compile=True' and succeed?

Yep and as follows,

Overriding: dataset = shakespeare Overriding: n_layer = 4 Overriding: n_head = 4 Overriding: n_embd = 128 Overriding: compile = True Overriding: eval_iters = 20 Overriding: block_size = 64 Overriding: batch_size = 12 Overriding: device = mps Overriding: log_interval = 1 Overriding: max_iters = 2000 Overriding: lr_decay_iters = 2000 Overriding: dropout = 0.0 Overriding: gradient_accumulation_steps = 1 Initializing a new model from scratch defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency) number of parameters: 7.23M using fused AdamW: False compiling the model... (takes a ~minute) step 0: train loss 10.8272, val loss 10.8203 iter 0: loss 10.8421, time 2852.64ms, mfu -100.00% iter 1: loss 10.8099, time 522.30ms, mfu -100.00% ... iter 2000: loss 2.6286, time 1241.70ms, mfu 0.16% python train.py config/train_shakespeare_char.py --dataset=shakespeare 420.38s user 105.07s system 49% cpu 17:34.84 total

~/nanoGPT master ± pip list | grep torch
torch 2.1.0.dev20230401 torchaudio 2.1.0.dev20230401 torchvision 0.16.0.dev20230401

~/nanoGPT master ± python --version Python 3.9.6

Pixxinger avatar Apr 03 '23 00:04 Pixxinger

Reverting commit 086ebe1 or overriding gradient_accumulation_steps to 1 is not needed anymore. This seems to have been fixed via the file config/train_shakespeare_char.py with commit 21f9bff. I can confirm 30ms or 775ms iteration times on M1 Pro with mps and depending on whether using "I have a MacBook" settings or plain python train.py config/train_shakespeare_char.py --device=mps --compile=False.

BTW I also did not need the nightly PyTorch build for this. The version available on MacPorts did fine. I did have to comment out code in train.py regarding init_process_group, destroy_process_group and ddp (parallel processing on multiple GPUs).

0dB avatar Apr 23 '23 10:04 0dB