ggml icon indicating copy to clipboard operation
ggml copied to clipboard

Using Accelerate for vector scale

Open philipturner opened this issue 1 year ago • 8 comments

We could use Accelerate to scale the vector here, similarly to how add and exp use Accelerate.

https://github.com/ggerganov/ggml/blob/2992df03010bb6afe399f13378f20ed45b0758c8/src/ggml.c#L3250-L3277

https://developer.apple.com/documentation/accelerate/1450020-vdsp_vsmul

philipturner avatar May 24 '23 20:05 philipturner

I naively thought adding "if defined" on the top and setting the type correctly for 'vDSP_vsmul' would solve the problem easily. But when I modify the code like the following, I get segmentation fault. What do you think is the problem?

inline static void ggml_vec_scale_f32(const int n, float * y, const float v) {
#if defined(GGML_USE_ACCELERATE)
    vDSP_vsmul(y, 1, y, (float*) &v, 1, n);
#elif defined(GGML_SIMD)
.... // codes below intact
`

jaeminSon avatar May 25 '23 23:05 jaeminSon

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Btw I think you could vastly improve the softmax part by writing vectorized code that fuses each kernel call. Calling into Accelerate this way makes it memory bound with most time spent reading and writing everything from L1.

philipturner avatar May 26 '23 05:05 philipturner

I don’t know why, but LLaMa.cpp is much slower than it should theoretically be. Going by @ggerganov’s CPU bandwidth (200 GB/s), the CPU cores should eat the entire 6.7B-q4.5 model in 16 ms. But for some reason the token latency is 43 ms.

That’s a 2-3x speed up we could have by redesigning the code, not just an incremental speed up.

philipturner avatar May 26 '23 06:05 philipturner

Going by @ggerganov’s CPU bandwidth (200 GB/s)

This number is Apple's claim for the memory bandwidth of M1 Pro if I remember correctly. I haven't been able to reproduce this speed. The best I've seen is ~80-90 GB/s: https://github.com/ggerganov/llama.cpp/issues/34#issuecomment-1538750779

And regarding single thread, it's no more than 40GB/s

ggerganov avatar May 26 '23 06:05 ggerganov

Single thread has reached 100 GB/s in some benchmarks. When it’s occupied with other work or code is improperly written, it can’t utilize all of that. But then there are 8 cores total to harness that bandwidth.

On GPU (M1 Max), I have achieved 378 GB/s out of 400 GB/s in a custom Metal blit command. It requires careful tuning - aligning the data structure to 64B boundaries. From what I can tell, LLaMa.cpp is not aligned.

https://github.com/philipturner/metal-usm/blob/23e9f324fd3e4ecb1078cf6b211bd25753de718b/BlitEncoderAlternative/MainFile.swift#L31-L60

Going so far as to shuffle data around in threadgroup memory, just so whatever it eats and spits out is 64B aligned:

https://github.com/philipturner/metal-usm/blob/23e9f324fd3e4ecb1078cf6b211bd25753de718b/BlitEncoderAlternative/Kernels.metal#L177-L200

philipturner avatar May 26 '23 06:05 philipturner

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Shame! it should be

vDSP_vsmul(y, 1, (float*) &v, y, 1, n);

No segment fault anymore!

jaeminSon avatar May 26 '23 11:05 jaeminSon

I ran several times but SIMD tends to be faster.

hardware: MacBook Pro (Retina, 13-inch, Early 2015), 2.7 GHz dual core Intel Core i5, 8GB 1867 MHz DDR3, Intel Iris Graphics 6100 1536 MB
os: Mac OS Monterey (v12.6)
gpt-model: Cerebras-GPT-111M

output of GGML_SIMD

this is a tokenization test, but the user is getting the response.
I'm trying to test for a method called response:
         if (user == null)
             return response.getJSON().text()

        response.getJSON().text()

        response.getJSON().text()

This is the code that works. The exception is the method that I use to get the response.
I would appreciate any help, in any event.

A:

You're getting the response.
This is the tokenization test

It's probably the first time you've used it, but you're not exactly sure how to do it.
There are many things you can do to improve the way you are able to work with this code. It's the only way you can change a tokenization test for the model,

main: mem per token =  1712332 bytes
main:     load time =   715.21 ms
main:   sample time =    59.12 ms
main:  predict time =  9944.77 ms / 48.51 ms per token
main:    total time = 12506.17 ms

output using vDSP_vsmul,

main: prompt: 'this is a tokenization test'
main: number of tokens in prompt = 6, first 8 tokens: 5661 318 257 11241 1634 1332 

this is a tokenization test with this method and this method has a user-defined tokenizer.
                                                                                                                                                                                         

main: mem per token =  1712332 bytes
main:     load time =   842.32 ms
main:   sample time =    61.45 ms
main:  predict time = 11836.78 ms / 57.74 ms per token
main:    total time = 15593.00 ms

jaeminSon avatar May 26 '23 11:05 jaeminSon

Try replacing some other Accelerate calls with vectorized code. Bonus if you can fuse two elementwise operations of the softmax without writing the element back to memory in between.

  // NOTE: Softmax is expected to consume the most time, due to the latency of
  // each function call and inability to keep the elements in registers.
  // Consider writing vectorized Swift code for a fairer comparison to GPU.
  
  // Pseudocode for softmax operation:
  // (1) find maximum element in each row
  // (2) subtract the maximum from all elements
  // (3) apply the exponential operator to all elements
  // (4) find the sum of each row
  // (5) divide all elements by the sum
  for i in 0..<UInt(NQ) {
    // The elements to operate on.
    let n = UInt(NKV)
    let row = _QK + Int(i * n)
    
    // (1)
    var maxValue: Float = 0
    vDSP_maxv(row, 1, &maxValue, n)
    assert(maxValue != 0)
    
    // (2)
    maxValue = -maxValue
    vDSP_vsadd(row, 1, &maxValue, row, 1, n)
    
    // (3)
    vvexpf(row, row, &NKV)
    
    // (4)
    var sumValue: Float = 0
    vDSP_sve(row, 1, &sumValue, n)
    
    // (5)
    sumValue = simd_precise_recip(sumValue)
    vDSP_vsmul(row, 1, &sumValue, row, 1, n)
  }

Becomes

  for i in 0..<UInt(NQ) {
    // The elements to operate on.
    let n = UInt(NKV)
    let row = _QK + Int(i * n)
    
    // (1)
    var maxValue: Float = 0
    vDSP_maxv(row, 1, &maxValue, n)
    assert(maxValue != 0)

    // PSEUDOCODE STARTS
    typealias Vector = SIMD16<Float> // Try multiple vector lengths.
    var sumValueVec: Vector = .zero
    for i in 0..<n / Vector.elementCount { // TODO: Handle the last iteration carefully.
       let i_amp = i * Vector.elementCount
       let pointer = (row + i_amp).reinterpret_cast(Vector.self)

       // (2)
       // (3)
       let value = exp(pointer.pointee - maxValue)
       pointer.pointee = value

       // (4)
       sumValueVec += value
    }
    var sumValue: Float = sumValueVec.sum()
    // PSEUDOCODE ENDS
    
    // (5)
    sumValue = simd_precise_recip(sumValue)
    vDSP_vsmul(row, 1, &sumValue, row, 1, n)
  }

philipturner avatar May 26 '23 12:05 philipturner

Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?

Shame! it should be

vDSP_vsmul(y, 1, (float*) &v, y, 1, n);

No segment fault anymore!

Why are you casting? it seems redundant.

nullhook avatar Jul 05 '23 22:07 nullhook