llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

Fix the potential int overflow related to B*T*V

Open lancerts opened this issue 10 months ago • 8 comments

With

    int B = 8;
    int T = 1024;
    int V = 50257;

The ratio of (8 * 1024 * 52507) to INT_MAX is: 0.200298. Therefore, if we make B = 64, B * T * V overflows.

lancerts avatar Apr 16 '24 20:04 lancerts

64 bit indexing used to be a lot slower than 32 bit indexing (to the point that PyTorch ships two kernels for some things), so you might also want to check the perf impact. (I don't know if it happens in your case/with more recent GPUs, but I had an encounter that surprised me a lot with that many years ago.)

t-vi avatar Apr 17 '24 06:04 t-vi

In lines like const size_t N = (size_t)(B) * T * V; is the explicit cast needed?

karpathy avatar Apr 17 '24 20:04 karpathy

Yes, assuming there is an overflow possibility - the values on the right are int's so you'll want to upcast them to size_t (long int). Otherwise, the behavior isn't specified by the C standard on overflow. Based on a unit test I just ran, the compiler did NOT upcast and truncated the value on overflow so the size_t cast on the right is needed.

The only question I have is whether B needs to be (B)? It seems to not matter if you leave it B?

rosslwheeler avatar Apr 17 '24 21:04 rosslwheeler

In lines like const size_t N = (size_t)(B) * T * V; is the explicit cast needed?

    int B = 64;
    int T = 1024;
    int V = 50257;


    size_t N1 = B * T * V;
    size_t N2 = (size_t)(B) * T * V;       // correct version 
    size_t N22 = (size_t) B * T * V;       // correct version 
    int N3 = B * T * V;
    size_t N4 = (size_t) N3;

    printf("N1: %zu\n", N1);
    printf("N2: %zu\n", N2);
    printf("N22: %zu\n", N22);
    printf("N3: %d\n", N3);
    printf("N4: %zu\n", N4);

N1: 18446744072708227072                                                                                                        
N2: 3293642752                                                                                                                  
N22: 3293642752
N3: -1001324544
N4: 18446744072708227072

If we dont cast explicitly, i.e., N1 is basically first compute in the int range then cast to size_t, which is N4, this is wrong. @rosslwheeler as both N2 and N22 are correct so I think the () does not matter, but I'd like to keep explicit cast on B explicit.

lancerts avatar Apr 17 '24 21:04 lancerts

@lancerts Is there an inherent speed improvement using int's vs. long int? (Yes, I saw the comment from t-vi above) Just curious.

rosslwheeler avatar Apr 17 '24 21:04 rosslwheeler

Sorry I meant the casts look ugly to my eye. Maybe we could make the individual params size_t in the function declarations 🤔, so their products will come out size_t as well. hmm

karpathy avatar Apr 17 '24 22:04 karpathy

Ah - was asking the same above later too. Can we just use size_t or long int in the variable definitions instead of int to get rid of the casting? Or is there speed impact?

rosslwheeler avatar Apr 17 '24 22:04 rosslwheeler

Sorry I meant the casts look ugly to my eye. Maybe we could make the individual params size_t in the function declarations 🤔, so their products will come out size_t as well. hmm

Agreed, B, T, V themselves are most likely within int range, don't think they will overflow individually in meaningful setup so didn't change to size_t for individual parameter.

The concern is, if we make them all size_t, does it slow down kernels in actual training (simple setup seems OK but I cannot be sure for general settings, i.e., the actual GPT training step)..

lancerts avatar Apr 17 '24 22:04 lancerts