llm.c LOSS MISMATCH AT STEP 0: 2.864161 5.270007

I'm running on an iMac 27" with MacOS 14.4.1 and 'MPS' on an AMD Radeon Pro 5700 XT GPU. Comments on the message: LOSS MISMATCH AT STEP 0: 2.864161 5.270007

% python train_gpt2.py 
/Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages/torch/library.py:168: UserWarning: Warning only once for all operators,  other operators may also be overrided.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::var_mean.correction(Tensor self, int[1]? dim=None, *, Scalar? correction=None, bool keepdim=False) -> (Tensor, Tensor)
    registered at /Users/davidlaxer/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: MPS
  previous kernel: registered at /Users/davidlaxer/pytorch/build/aten/src/ATen/RegisterCPU.cpp:31470
       new kernel: registered at /dev/null:1881 (Triggered internally at /Users/davidlaxer/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:160.)
  self.m.impl(name, dispatch_key if dispatch_key != "" else "CompositeImplicitAutograd", fn)
using device: mps
loading weights from pretrained gpt: gpt2
loading cached tokens in data/TinyStories_val.bin
wrote gpt2_124M.bin
wrote gpt2_124M_debug_state.bin
iteration 0, loss: 2.86417818069458, time: 2833.053ms
iteration 1, loss: 2.071094512939453, time: 224.394ms
iteration 2, loss: 1.5036660432815552, time: 211.509ms
iteration 3, loss: 1.0592901706695557, time: 213.439ms
iteration 4, loss: 0.67463219165802, time: 220.032ms
iteration 5, loss: 0.41782039403915405, time: 212.604ms
iteration 6, loss: 0.23388634622097015, time: 218.274ms
iteration 7, loss: 0.1198703944683075, time: 216.175ms
iteration 8, loss: 0.07279403507709503, time: 214.477ms
iteration 9, loss: 0.05021585151553154, time: 216.127ms
<|endoftext|>Once upon a time, there was a brave little girl named Lily. She loved
---------------
(AI-Feynman) davidlaxer@bluediamond llm.c % ./test_gpt2            
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
[State]
batch_size: 4
seq_len: 64
num_activations: 73323776
-43.431774 -43.431667
-39.836498 -39.836399
-43.066059 -43.065937
OK (LOGITS)
LOSS OK: 2.864161 2.864178
dwte
OK 0.000305 0.000305
OK -0.001153 -0.001153
OK 0.002915 0.002916
OK 0.001172 0.001172
OK 0.001833 0.001833
TENSOR OK
dwpe
OK -0.000876 -0.000873
OK -0.001630 -0.001632
OK 0.000169 0.000171
OK 0.004004 0.004006
OK 0.001349 0.001350
TENSOR OK
dln1w
OK 0.001079 0.001080
OK 0.001543 0.001546
OK 0.005387 0.005396
OK -0.004479 -0.004483
OK 0.002376 0.002377
TENSOR OK
dln1b
OK -0.030615 -0.030608
OK -0.005774 -0.005792
OK 0.007526 0.007541
OK -0.002763 -0.002748
OK -0.005131 -0.005105
TENSOR OK
dqkvw
OK 0.000018 0.000018
OK -0.000016 -0.000016
OK 0.000007 0.000007
OK -0.000046 -0.000046
OK 0.000060 0.000060
TENSOR OK
dqkvb
OK 0.000082 0.000082
OK -0.000057 -0.000057
OK 0.000116 0.000115
OK -0.000353 -0.000353
OK 0.000134 0.000134
TENSOR OK
dattprojw
OK 0.000003 0.000003
OK -0.000052 -0.000052
OK 0.000032 0.000032
OK 0.000001 0.000001
OK 0.000003 0.000003
TENSOR OK
dattprojb
OK -0.000731 -0.000730
OK -0.000462 -0.000462
OK 0.001580 0.001580
OK 0.009938 0.009946
OK -0.010287 -0.010277
TENSOR OK
dln2w
OK 0.001610 0.001621
OK 0.003811 0.003809
OK -0.000234 -0.000234
OK -0.000306 -0.000306
OK 0.001870 0.001871
TENSOR OK
dln2b
OK 0.001294 0.001301
OK 0.002362 0.002358
OK -0.000720 -0.000718
OK 0.007980 0.007986
OK -0.009129 -0.009118
TENSOR OK
dfcw
OK -0.000160 -0.000160
OK -0.000017 -0.000017
OK 0.000180 0.000180
OK 0.000353 0.000354
OK -0.000134 -0.000134
TENSOR OK
dfcb
OK 0.000981 0.000979
OK 0.001020 0.001021
OK 0.000001 0.000001
OK 0.000065 0.000065
OK -0.000360 -0.000360
TENSOR OK
dfcprojw
OK -0.000043 -0.000043
OK 0.000180 0.000180
OK 0.000039 0.000039
OK 0.000073 0.000072
OK 0.000013 0.000013
TENSOR OK
dfcprojb
OK -0.000847 -0.000846
OK -0.000840 -0.000839
OK 0.001708 0.001708
OK 0.001528 0.001530
OK -0.000068 -0.000069
TENSOR OK
dlnfw
OK 0.000495 0.000495
OK 0.000482 0.000482
OK -0.000429 -0.000428
OK -0.000885 -0.000886
OK -0.000293 -0.000293
TENSOR OK
dlnfb
OK -0.002935 -0.002935
OK -0.009005 -0.009004
OK 0.000930 0.000929
OK 0.002993 0.002993
OK 0.007300 0.007300
TENSOR OK
step 0: loss 2.864161 (took 15857.063000 ms)
step 1: loss 2.071029 (took 15620.254000 ms)
step 2: loss 1.503640 (took 15320.660000 ms)
step 3: loss 1.059259 (took 15308.281000 ms)
step 4: loss 0.674684 (took 15283.269000 ms)
step 5: loss 0.418388 (took 15243.285000 ms)
step 6: loss 0.233658 (took 15149.562000 ms)
step 7: loss 0.119678 (took 15595.255000 ms)
step 8: loss 0.072536 (took 15444.902000 ms)
step 9: loss 0.050150 (took 15191.823000 ms)
LOSS MISMATCH AT STEP 0: 2.864161 5.270007
LOSS MISMATCH AT STEP 1: 2.071029 4.059707
LOSS MISMATCH AT STEP 2: 1.503640 3.375123
LOSS MISMATCH AT STEP 3: 1.059259 2.800783
LOSS MISMATCH AT STEP 4: 0.674684 2.315382
LOSS MISMATCH AT STEP 5: 0.418388 1.849029
LOSS MISMATCH AT STEP 6: 0.233658 1.394656
LOSS MISMATCH AT STEP 7: 0.119678 0.999147
LOSS MISMATCH AT STEP 8: 0.072536 0.624080
LOSS MISMATCH AT STEP 9: 0.050150 0.376511
overall okay: 0

Apr 11 '24 16:04 dbl001

@oxyno-zeta I hope you're not bothered by me requesting reviews for these pull requests! Let me know if there's anything I can do to make things easier for you.

I've already tested this myself in the dev app, but wanted to make sure you're okay with my changes before merging into master. And you also have npm publish rights! 😄

Nov 10 '22 01:11 Phanabani

Hello @Phanabani ,

Thanks a lot for developing on this project. Reviewing your pull requests is always a pleasure.

One quick question: why do you consider the freeze as a fix and not a feature? I'm not telling you to change because can be both for me, just wondering.

Will read as soon as possible. May be next week if it isn't today (holidays tomorrow ;) ).

Oxyno-zeta

Nov 10 '22 06:11 oxyno-zeta

Hi @oxyno-zeta!

I'm happy that you appreciate my work! It means a lot to me.

I considered the freeze as a fix and not a feature because the user mutating keyPath is unintended behavior and produces bad internal side effects since we retain and reuse a reference to the array. I think I would consider it a feature if the side effects weren't involved.

I hope you have a good holiday weekend! 😄

Nov 10 '22 18:11 Phanabani