ggml
ggml copied to clipboard
Using Accelerate for vector scale
We could use Accelerate to scale the vector here, similarly to how add
and exp
use Accelerate.
https://github.com/ggerganov/ggml/blob/2992df03010bb6afe399f13378f20ed45b0758c8/src/ggml.c#L3250-L3277
https://developer.apple.com/documentation/accelerate/1450020-vdsp_vsmul
I naively thought adding "if defined" on the top and setting the type correctly for 'vDSP_vsmul' would solve the problem easily. But when I modify the code like the following, I get segmentation fault. What do you think is the problem?
inline static void ggml_vec_scale_f32(const int n, float * y, const float v) {
#if defined(GGML_USE_ACCELERATE)
vDSP_vsmul(y, 1, y, (float*) &v, 1, n);
#elif defined(GGML_SIMD)
.... // codes below intact
`
Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?
Btw I think you could vastly improve the softmax part by writing vectorized code that fuses each kernel call. Calling into Accelerate this way makes it memory bound with most time spent reading and writing everything from L1.
I don’t know why, but LLaMa.cpp is much slower than it should theoretically be. Going by @ggerganov’s CPU bandwidth (200 GB/s), the CPU cores should eat the entire 6.7B-q4.5 model in 16 ms. But for some reason the token latency is 43 ms.
That’s a 2-3x speed up we could have by redesigning the code, not just an incremental speed up.
Going by @ggerganov’s CPU bandwidth (200 GB/s)
This number is Apple's claim for the memory bandwidth of M1 Pro if I remember correctly. I haven't been able to reproduce this speed. The best I've seen is ~80-90 GB/s: https://github.com/ggerganov/llama.cpp/issues/34#issuecomment-1538750779
And regarding single thread, it's no more than 40GB/s
Single thread has reached 100 GB/s in some benchmarks. When it’s occupied with other work or code is improperly written, it can’t utilize all of that. But then there are 8 cores total to harness that bandwidth.
On GPU (M1 Max), I have achieved 378 GB/s out of 400 GB/s in a custom Metal blit command. It requires careful tuning - aligning the data structure to 64B boundaries. From what I can tell, LLaMa.cpp is not aligned.
https://github.com/philipturner/metal-usm/blob/23e9f324fd3e4ecb1078cf6b211bd25753de718b/BlitEncoderAlternative/MainFile.swift#L31-L60
Going so far as to shuffle data around in threadgroup memory, just so whatever it eats and spits out is 64B aligned:
https://github.com/philipturner/metal-usm/blob/23e9f324fd3e4ecb1078cf6b211bd25753de718b/BlitEncoderAlternative/Kernels.metal#L177-L200
Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?
Shame! it should be
vDSP_vsmul(y, 1, (float*) &v, y, 1, n);
No segment fault anymore!
I ran several times but SIMD tends to be faster.
hardware: MacBook Pro (Retina, 13-inch, Early 2015), 2.7 GHz dual core Intel Core i5, 8GB 1867 MHz DDR3, Intel Iris Graphics 6100 1536 MB
os: Mac OS Monterey (v12.6)
gpt-model: Cerebras-GPT-111M
output of GGML_SIMD
this is a tokenization test, but the user is getting the response.
I'm trying to test for a method called response:
if (user == null)
return response.getJSON().text()
response.getJSON().text()
response.getJSON().text()
This is the code that works. The exception is the method that I use to get the response.
I would appreciate any help, in any event.
A:
You're getting the response.
This is the tokenization test
It's probably the first time you've used it, but you're not exactly sure how to do it.
There are many things you can do to improve the way you are able to work with this code. It's the only way you can change a tokenization test for the model,
main: mem per token = 1712332 bytes
main: load time = 715.21 ms
main: sample time = 59.12 ms
main: predict time = 9944.77 ms / 48.51 ms per token
main: total time = 12506.17 ms
output using vDSP_vsmul,
main: prompt: 'this is a tokenization test'
main: number of tokens in prompt = 6, first 8 tokens: 5661 318 257 11241 1634 1332
this is a tokenization test with this method and this method has a user-defined tokenizer.
main: mem per token = 1712332 bytes
main: load time = 842.32 ms
main: sample time = 61.45 ms
main: predict time = 11836.78 ms / 57.74 ms per token
main: total time = 15593.00 ms
Try replacing some other Accelerate calls with vectorized code. Bonus if you can fuse two elementwise operations of the softmax without writing the element back to memory in between.
// NOTE: Softmax is expected to consume the most time, due to the latency of
// each function call and inability to keep the elements in registers.
// Consider writing vectorized Swift code for a fairer comparison to GPU.
// Pseudocode for softmax operation:
// (1) find maximum element in each row
// (2) subtract the maximum from all elements
// (3) apply the exponential operator to all elements
// (4) find the sum of each row
// (5) divide all elements by the sum
for i in 0..<UInt(NQ) {
// The elements to operate on.
let n = UInt(NKV)
let row = _QK + Int(i * n)
// (1)
var maxValue: Float = 0
vDSP_maxv(row, 1, &maxValue, n)
assert(maxValue != 0)
// (2)
maxValue = -maxValue
vDSP_vsadd(row, 1, &maxValue, row, 1, n)
// (3)
vvexpf(row, row, &NKV)
// (4)
var sumValue: Float = 0
vDSP_sve(row, 1, &sumValue, n)
// (5)
sumValue = simd_precise_recip(sumValue)
vDSP_vsmul(row, 1, &sumValue, row, 1, n)
}
Becomes
for i in 0..<UInt(NQ) {
// The elements to operate on.
let n = UInt(NKV)
let row = _QK + Int(i * n)
// (1)
var maxValue: Float = 0
vDSP_maxv(row, 1, &maxValue, n)
assert(maxValue != 0)
// PSEUDOCODE STARTS
typealias Vector = SIMD16<Float> // Try multiple vector lengths.
var sumValueVec: Vector = .zero
for i in 0..<n / Vector.elementCount { // TODO: Handle the last iteration carefully.
let i_amp = i * Vector.elementCount
let pointer = (row + i_amp).reinterpret_cast(Vector.self)
// (2)
// (3)
let value = exp(pointer.pointee - maxValue)
pointer.pointee = value
// (4)
sumValueVec += value
}
var sumValue: Float = sumValueVec.sum()
// PSEUDOCODE ENDS
// (5)
sumValue = simd_precise_recip(sumValue)
vDSP_vsmul(row, 1, &sumValue, row, 1, n)
}
Try narrowing that into a standalone C++ or Swift program. Does the fault still happen?
Shame! it should be
vDSP_vsmul(y, 1, (float*) &v, y, 1, n);
No segment fault anymore!
Why are you casting? it seems redundant.