llm.c enhanced tensor comparison with higher precision.

instead of directly comparing the absolute difference of elements, we can square the difference (a[i] - b[i]) * (a[i] - b[i]) and compare it against a smaller threshold 1e-4.

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
[State]
batch_size: 4
seq_len: 64
num_activations: 73323776
-43.431686 -43.431732
-39.836414 -39.836441
-43.065941 -43.065971
OK (LOGITS)
LOSS OK: 5.269892 5.269998
dwte
OK -0.002320 -0.002320
OK 0.002072 0.002072
OK 0.003716 0.003717
OK 0.001307 0.001307
OK 0.000631 0.000632
TENSOR OK
dwpe
OK -0.005118 -0.005112
OK -0.000001 -0.000009
OK -0.003267 -0.003263
OK 0.009909 0.009913
OK 0.002155 0.002146
TENSOR OK
dln1w
OK -0.007520 -0.007525
OK 0.008624 0.008638
OK 0.005004 0.005024
OK -0.011099 -0.011095
OK -0.001666 -0.001665
TENSOR OK
dln1b
OK -0.038494 -0.038475
OK -0.030547 -0.030598
OK 0.010189 0.010218
OK 0.080134 0.080188
OK -0.060991 -0.060927
TENSOR OK
dqkvw
OK -0.000031 -0.000031
OK -0.000026 -0.000025
OK -0.000064 -0.000064
OK 0.000074 0.000074
OK 0.000020 0.000020
TENSOR OK
dqkvb
OK -0.000414 -0.000412
OK -0.000410 -0.000411
OK 0.000113 0.000113
OK -0.000564 -0.000565
OK 0.000574 0.000571
TENSOR OK
dattprojw
OK 0.000081 0.000081
OK -0.000005 -0.000005
OK -0.000019 -0.000019
OK 0.000005 0.000004
OK 0.000031 0.000031
TENSOR OK
dattprojb
OK 0.000456 0.000467
OK -0.009969 -0.009975
OK -0.001794 -0.001800
OK 0.037638 0.037609
OK -0.031287 -0.031252
TENSOR OK
dln2w
OK -0.018372 -0.018318
OK 0.004812 0.004814
OK 0.008084 0.008093
OK -0.001465 -0.001469
OK -0.002740 -0.002737
TENSOR OK
dln2b
OK -0.026405 -0.026364
OK -0.016712 -0.016694
OK 0.001067 0.001085
OK 0.034754 0.034732
OK -0.028630 -0.028592
TENSOR OK
dfcw
OK 0.000438 0.000439
OK -0.000000 -0.000000
OK -0.000153 -0.000154
OK -0.000165 -0.000165
OK 0.000404 0.000405
TENSOR OK
dfcb
OK 0.003282 0.003288
OK 0.002038 0.002042
OK -0.001386 -0.001386
OK 0.000381 0.000386
OK 0.001602 0.001604
TENSOR OK
dfcprojw
OK 0.000678 0.000680
OK 0.000073 0.000073
OK -0.000415 -0.000416
OK -0.000059 -0.000060
OK -0.000603 -0.000603
TENSOR OK
dfcprojb
OK 0.003572 0.003579
OK -0.007148 -0.007155
OK -0.001955 -0.001962
OK 0.001466 0.001463
OK 0.001219 0.001214
TENSOR OK
dlnfw
OK -0.000022 -0.000022
OK 0.000811 0.000811
OK 0.001161 0.001161
OK -0.002956 -0.002957
OK 0.001146 0.001145
TENSOR OK
dlnfb
OK -0.011101 -0.011101
OK 0.008007 0.008006
OK -0.004763 -0.004767
OK -0.005903 -0.005905

all the tests are passing

Apr 09 '24 11:04 anurag12-webster

What is the advantage?

Apr 09 '24 12:04 karpathy

What is the advantage?

square operation is generally faster than the fabs function, as it avoids the additional logic required to handle the sign and on top of it the square difference as fundamental can be easily further be improved through SIMD vectorization.

Apr 09 '24 14:04 anurag12-webster

ok for now i think

Apr 10 '24 19:04 karpathy