llm.c
llm.c copied to clipboard
Fix the potential int overflow related to B*T*V
With
int B = 8;
int T = 1024;
int V = 50257;
The ratio of (8 * 1024 * 52507) to INT_MAX is: 0.200298. Therefore, if we make B = 64, B * T * V overflows.
64 bit indexing used to be a lot slower than 32 bit indexing (to the point that PyTorch ships two kernels for some things), so you might also want to check the perf impact. (I don't know if it happens in your case/with more recent GPUs, but I had an encounter that surprised me a lot with that many years ago.)
In lines like const size_t N = (size_t)(B) * T * V;
is the explicit cast needed?
Yes, assuming there is an overflow possibility - the values on the right are int's so you'll want to upcast them to size_t (long int). Otherwise, the behavior isn't specified by the C standard on overflow. Based on a unit test I just ran, the compiler did NOT upcast and truncated the value on overflow so the size_t cast on the right is needed.
The only question I have is whether B needs to be (B)? It seems to not matter if you leave it B?
In lines like
const size_t N = (size_t)(B) * T * V;
is the explicit cast needed?
int B = 64;
int T = 1024;
int V = 50257;
size_t N1 = B * T * V;
size_t N2 = (size_t)(B) * T * V; // correct version
size_t N22 = (size_t) B * T * V; // correct version
int N3 = B * T * V;
size_t N4 = (size_t) N3;
printf("N1: %zu\n", N1);
printf("N2: %zu\n", N2);
printf("N22: %zu\n", N22);
printf("N3: %d\n", N3);
printf("N4: %zu\n", N4);
N1: 18446744072708227072
N2: 3293642752
N22: 3293642752
N3: -1001324544
N4: 18446744072708227072
If we dont cast explicitly, i.e., N1 is basically first compute in the int range then cast to size_t, which is N4, this is wrong. @rosslwheeler as both N2 and N22 are correct so I think the () does not matter, but I'd like to keep explicit cast on B explicit.
@lancerts Is there an inherent speed improvement using int's vs. long int? (Yes, I saw the comment from t-vi above) Just curious.
Sorry I meant the casts look ugly to my eye. Maybe we could make the individual params size_t
in the function declarations 🤔, so their products will come out size_t
as well. hmm
Ah - was asking the same above later too. Can we just use size_t or long int in the variable definitions instead of int to get rid of the casting? Or is there speed impact?
Sorry I meant the casts look ugly to my eye. Maybe we could make the individual params
size_t
in the function declarations 🤔, so their products will come outsize_t
as well. hmm
Agreed, B, T, V themselves are most likely within int range, don't think they will overflow individually in meaningful setup so didn't change to size_t for individual parameter.
The concern is, if we make them all size_t, does it slow down kernels in actual training (simple setup seems OK but I cannot be sure for general settings, i.e., the actual GPT training step)..