ggml
ggml copied to clipboard
[2GPU] Memcpy2D of matrixXmatrix -- src size (and form)
Hello, Mr. @ggerganov , thank you for awesome project.
ggml_cuda_op_mul_mat() {
...
// copy dst to host or other device if necessary
if (!dst_on_device) {
void * dst_off_device;
cudaMemcpyKind kind;
if (dst->backend == GGML_BACKEND_CPU) {
dst_off_device = dst->data;
kind = cudaMemcpyDeviceToHost;
} else if (dst->backend == GGML_BACKEND_GPU) {
dst_off_device = dst_extra->data_device[g_main_device];
kind = cudaMemcpyDeviceToDevice;
} else {
GGML_ASSERT(false);
}
if (split) {
// src0 = weight matrix is saved as a transposed matrix for better memory layout.
// dst is NOT transposed.
// The outputs of matrix matrix multiplications can therefore NOT simply be concatenated for >1 GPU.
// Instead they need to be copied to the correct slice in ne0 = dst row index.
// If dst is a vector with ne0 == 1 then you don't have to do this but it still produces correct results.
float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3);
GGML_ASSERT(dst->nb[1] == ne0*sizeof(float));
dhf_dst_i += src1_col_0*ne0 + row_low[id];
CUDA_CHECK(cudaMemcpy2DAsync(dhf_dst_i, ne0*sizeof(float), dst_dd_i, row_diff*sizeof(float),
row_diff*sizeof(float), src1_ncols, kind, stream));
} else {
float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3);
GGML_ASSERT(dst->nb[1] == ne0*sizeof(float));
dhf_dst_i += src1_col_0*ne0;
CUDA_CHECK(cudaMemcpyAsync(dhf_dst_i, dst_dd_i, src1_ncols*ne0*sizeof(float), kind, stream));
}
}
// add event for the main device to wait on until other device is done
if (split && (id != g_main_device || is != 0)) {
CUDA_CHECK(cudaEventRecord(src0_extra->events[id][is], stream));
}
}
...
}
I'm trying to understand how this code should work.
What I clarify atm, is:
- when split is true, than call to cudaMemcpy2DAsync() resulting in SIGSEGV, because of nullptr deref deeply in libhipamd64.so.
- the total src size (WxH) would be half from
src1_ncols*ne0*sizeof(float)
, when running on two GPUs. - the dst pointer, mem, memsize, and mem data (checked under rocgdb in runtime) seems to be ok.
- the src mem + (WxH - 1) points to last usable byte of weights matrix, which seems to be ok too.
where WxH is
row_diff*sizeof(float)*src1_ncols
- if change 2D to classic async memcpy -- the crash is gone, but the results is wrong.
What I asking to hint, or explain me, please:
- How and why dpitch and spitch should be formed
- why src pitch is so large?
- why src height is so small?
- if 2 and 3 perform un-transposing of result?
- why does it crash?
- WxH is okey, but WxH*pitch is oob?
Thank you 💯
[in] | width | Width of matrix transfer (columns in bytes) [in] | height | Height of matrix transfer (rows)
Hey, it seems I found a bug.
CUDA_CHECK(hipMemPtrGetInfo(dst_dd_i, &srcSize));
say, that src size is 0x9a00
in my case.
row_diff*sizeof(float)*src1_ncols
== 0x9200
in my case.
but srcPitch is 0x2480
.
If I understand right, to finish Memcpy2D we need at least M = (N / (Y / 2)) * Y
.
output drom gdb confirm this:
pwndbg> hex SRC
+0000 0x7ff60d413000 c0 1b 52 bf 98 e9 b8 3f aa 2c d7 bf 38 59 3e bf │..R.│...?│.,..│8Y>.│
+0010 0x7ff60d413010 76 71 8d bd 49 1c 58 bf ab 2e 26 bf 97 ad 12 3f │vq..│I.X.│..&.│...?│
+0020 0x7ff60d413020 4b 2f 32 3f 17 30 2b 3e 3e 97 03 c0 b4 9f de bf │K/2?│.0+>│>...│....│
+0030 0x7ff60d413030 70 5c 26 be da 86 3d be 8e 6d 5f 3f b0 2a 09 3f │p\&.│..=.│.m_?│.*.?│
pwndbg> hex SRC+0x91f0
+0000 0x7ff60d41c1f0 e6 9c 32 3e 76 3a 41 be 0c 26 5b 3c 78 ba 3b 3e │..2>│v:A.│.&[<│x.;>│
+0010 0x7ff60d41c200 be be be be be be be be be be be be be be be be │....│....│....│....│
...
+9820 0x7ff60d41d000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 │....│....│....│....│
@ggerganov , what do you think about this?
And there are some more, about P2P memory access, but I don't sure yet.
I'm afraid it will be difficult for me to help here, because I don't have a multi-GPU system to test with and I am not very familiar with this code. In general, multi-GPU support in ggml
is in a poor state and probably some things don't work and can be improved.
If you think you've found a bug, please provide a proposed fix and steps to reproduce.
I'm trying to do this.
At the moment I can't understand why we can use result of src0 x src1 as is on single gpu, but can't on several gpus.
For hipBLAS we can't use memcpy2D to copy from gpu1 to gpu0, because memcpy2d doesn't support p2p mem access. We should copy by hand in a loop using memcpyd2d... At least for gpus which doesn't support p2p access.
As for twin gpu - I have a luck to have access to system with that HW on the work.
Upd: I impl loop, to copy data from src to dst with correct pitches and using DtoD. This avoid crash, but results is incorrect (model just generate none-sense).
Could somebody explain what we are expecting to be in dst after call to ggml_cuda_op_mul_mat?
Upd2: dst_on_device is true when running on single gpu. So this ifcase just skipped and then everything works great. Split is true too, when running on single gpu. Both src0 and src1 in non-transposed when running on single or multi gpu.