ginkgo
ginkgo copied to clipboard
Preconditioner overhead
All Krylov solvers are designed to take a preconditioner. If no preconditioner is present, an explicit copy of the Krylov vector is invoked. We see this can introduce significant overhead. E.g. for the thermal2 testcase, this is 33% of the BiCGSTAB runtime.
We should think about a better solution. Maybe point to the un-preconditioned vector?
Checking the GPU preconditioner application in more detail, it consists of an allocation, a copy, and a free. I can understand the first to parts, but where does the free come from? If it invokes the identity matrix copy, there should be no free...
[LOG] >>> allocation started on Executor[gko::CudaExecutor,0x757ba0] with Bytes[9824360]
[LOG] >>> allocation completed on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> copy started from Executor[gko::CudaExecutor,0x757ba0] to Executor[gko::CudaExecutor,0x757ba0] from Location[0x7fffbce00000] to Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> copy completed from Executor[gko::CudaExecutor,0x757ba0] to Executor[gko::CudaExecutor,0x757ba0] from Location[0x7fffbce00000] to Location[0x7fffcf400000] with Bytes[9824360]
[LOG] >>> free started on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffa8c00000]
[LOG] >>> free completed on Executor[gko::CudaExecutor,0x757ba0] at Location[0x7fffa8c00000]
There should be only a single copy - no allocation and no free, can't see where it's coming from right now.
I think I found the problem. Here is what is happening:
- Identity matrix's apply calls
copy_from
: https://github.com/ginkgo-project/ginkgo/blob/develop/core/matrix/identity.cpp#L49 -
copy_from
is generated through theEnableAbstractPolymorphicObject
mixin (included through theEnableLinOp
=>EnablePolymorphicObject
=>EnableAbstractPolymorphicObject
chain), and calls the virtualcopy_from_impl
method: https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L317-
NOTE: we have a bug with the logger events here - we do not log the
polymorphic_object_copy_started
and[...]_completed
events in this case (we only log it in the basePolymorphicObject
version, but not in the mixin ones). Though this is not the cause of the behavior we see here.
-
NOTE: we have a bug with the logger events here - we do not log the
-
copy_from_impl
is implemented byEnablePolymorphicObject
, and since our result is aDense
vector, it will try to use theconvert_to(Dense *)
method from theConvertibleTo<Dense>
interface of the source operator: https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L461 - Since the source is a
Dense
vector, itsConvertibleTo<Dense>
interface is implemented by theEnablePolymorphicAssignment
mixin (included through theEnableLinOp
=>EnablePolymorphicAssignment
chain), which calls theDense
assignment operator*result = static_cast<Dense>(*this)
. And I think the static_cast here is the problem - it will first call theDense
copy constructor to perform the cast, creating a temporary object, and then it will move-assign it to the result. https://github.com/ginkgo-project/ginkgo/blob/develop/core/base/polymorphic_object.hpp#L502
I'll try to implement this a bit differently and see if we still encounter the same problem.
So as a summary, what is left after the fix of #166 is that we still have a copy of the full residual vector even when the method has no preconditioner (due to using an "Identity" preconditioner). We want a way to not make any call to any preconditioner is such cases. But this implies a rather different algorithm from out current one for all solvers.
I would first determine if that copy is even a bottleneck or not. What we were measuring before is allocation + copy + free. If you look at the timeline in #166 the copy itself looks negligible, but for some reason there is a gap between the call to the copy, and when it actually happens. We should try to figure out why this is, and then fix the root cause.
In general, I wouldn't just blindly go and complicate the implementation just because we think something is the bottleneck, but don't have the evidence for it. This also applies to #161.
That's why I would recommend using a profiling tool (either nvvp, or paraver if we need more flexibility), which can give us more details about what is really happening.
I uploaded here a trace taken on a V100 on Juwels to figure out the copy time for BiCGSTAB on thermal2.
We extracted the timing by hand with @hartwiganzt for one iteration of BiCGSTAB (technically a "half loop iteration") and compared the preconditioner apply time to all the other kernel calls. Here is what we see:
+ residual check: >100 us (to fix)
+ step1: 47 us
+ preconditioner copy: 30 us
+ spmv : 200 us
+ dot: 25-30 us, two times
+ Step2 : 70 us
If we remove the residual check which is still "bugged" in this vesion, we see that in that case the preconditioner copy is roughly 7% of one half iteration. The other half iteration should be similar as for reference the step3 kernel takes 90 us, there are two dots again, one spmv, etc.
From this information, we can conclude two things:
- This 7% is rather small and might be even lesser for other matrices (more dense). We could consider it to be acceptable.
- If we find a better way of handling the preconditioner copy which does not imply code duplication, then we could nonetheless gain an extra <7%.
step1: 3n reads + 1n writes -> 47 us (778 GB/s)
precond. copy: 1n reads + 1n writes -> 30 us (609 GB/s)
spmv (csr?): 3/2(nnz + n) reads + 1n writes -> 200 us (593 GB/s)
dot: 2n reads + some sync -> 25-30 us (609-731 GB/s)
step2: 2n reads + 1n writes -> 70 us (392 GB/s)
step3: 5n reads + 2n writes -> 90 us (711 GB/s)
Notes:
- The theoretical peak BW of the V100 is 900 GB/s.
- The
bandwidthTest
CUDA sample assumes thatcudaMemcpy
has a total of 2n memory transfers (1 read, 1 write), so 609 GB/s should be the correct number. - I modified the CUDA's
0_Simple/vectorAdd
sample and got the counters fromnvprof
. A version that computesC = A + B
on 128MB vectors reads 256MB from DRAM to L2 and writes 128MB from L2 to DRAM. A version that computesC = C + A + B
reads 384MB from DRAM to L2 and writes 128MB from L2 to DRAM. Thus, it seems that GPUs (at least my GTX965M) use the no-write allocate policy, and output-only data does not have to be counted in the total reads (I think this is different than the default CPU behavior).
Questions remain:
- Why is a simple copy, which just uses the CUDA runtime API as slow as (or even slower than) a dot product?
- Why is step2 so slow?
2. If we find a better way of handling the preconditioner copy which does not imply code duplication, then we could nonetheless gain an extra <7%.
At least in the BiCGSTAB case, the preconditioner ties the quantities p
and y
; and s
and z
(note that this discussion would be a lot easier if we used meaningful names as per #29).
y
is used in first spmv, finalize and step3, in between the preconditioner application that generates y
and these two operations, p
is not modified (it is only modified in step1 of the subsequent iteration). Thus, making y
a reference to p
if the preconditioner is identity would remove the need for this data copy.
z
is used in the second spmv and step3. s
is only modified in step2 of the subsequent iteration. Thus, z
can reference the same memory as s
if the preconditioner is an identity.
Just ran 1_Utilities/p2pBandwidthLatencyTest
on juwels. Intra-device copy should achieve around 750 GB/s. Note though, that we're using cudaMemcpyPeer
and the CUDA sample uses cudaMemcpyPeerAsync
to do the copy.
Detailed results
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla V100-SXM2-16GB, pciBusID: 5d, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla V100-SXM2-16GB, pciBusID: 5e, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla V100-SXM2-16GB, pciBusID: 88, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla V100-SXM2-16GB, pciBusID: 89, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 741.22 9.78 11.01 11.00
1 9.86 749.76 11.02 10.99
2 11.03 10.97 751.20 9.74
3 11.04 10.98 9.78 749.76
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 741.22 48.60 48.57 48.58
1 48.58 749.76 48.58 48.58
2 48.60 48.60 751.20 48.60
3 48.57 48.58 48.57 752.65
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 757.76 10.39 17.80 18.95
1 10.37 755.56 17.24 17.89
2 17.82 17.33 759.23 10.40
3 18.88 17.73 10.42 759.97
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 753.38 97.01 96.94 96.99
1 97.01 757.03 97.03 97.03
2 97.04 97.06 756.29 97.04
3 97.06 97.04 97.04 754.83
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3
0 2.11 17.87 18.66 18.62
1 17.52 2.16 18.74 18.66
2 18.53 18.51 2.15 17.57
3 18.05 17.97 17.52 2.18
CPU 0 1 2 3
0 3.73 9.35 9.44 9.59
1 8.98 3.52 9.25 9.46
2 9.32 9.18 3.75 9.75
3 9.38 9.20 9.71 3.81
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3
0 2.14 2.09 2.09 2.10
1 2.17 2.17 2.17 2.17
2 2.11 2.10 2.15 2.12
3 2.11 2.10 2.11 2.20
CPU 0 1 2 3
0 4.00 2.85 2.67 2.67
1 2.65 3.61 2.74 2.64
2 2.82 2.86 3.72 2.81
3 2.86 2.87 2.83 3.87
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.