adda
adda copied to clipboard
small maximum single buffer memory allocation on GPUs limits particle size
Current OpenCL drivers from Nvidia limit their maximum memory objects to 1/4th of the device memory. This is a big restriction in adda as the major memory usage is from a single buffer (the Dmatrix). In case of recent Nvidia consumer GPUs adda thus can only use about 1/3rd of the total device memory. AMD seems to have a default of 1/2 of the total device memory allocatable by a single buffer so that the problem is less severe on AMD cards. One possible improvement would be #119 which would enable doubling the particle sizes by having smaller memory allocations. Another possibility might be to keep the Dmatrix in the host memory and access it directly from the GPU as host pointer, which would probably come with a performance decrease.
Is it possible to split the Dmatrix in several parts? It has six independent components anyway, hasn't it? I even think that long ago those six components were split into separate variables in the main sequential (and MPI) code. But that was probably before we started the source control. Sure, such splitting may make access to GPU memory less optimal (I am not sure), but then you can consider splitting in three parts instead of six.
The splitting again into the independent components is a good idea for the OpenCL version to get around the allocation problem.
Three parts would probably be enough for that, using all six components will add five more memory buffers as kernel arguments.
However, the maximum allowed number of buffers as kernel arguments is device dependent and in the order of ~10, so this could be a problem for some devices.
The splitting could be even faster since the indexing on the entire Dmatrix was not too linear so that caching was not that efficient at the moment.
Only the arith3
and arith3_surface
kernels should be affected by this change if I see it correctly.
I'll see if I find the time to play around a bit with that later this week.
Concerning the maximum allowed number of buffers as kernel argument, is there some number which is always available as mandated by the particular OpenCL standard? If yes, it is best to try to fit into this, since tuning for a particular devices is definitely not what we want to do.
I tried around with splitting the Dmatrix into the same chunks as the innermost loop which is in principle the thickness of the slices in x-direction. Instead of handing over the whole Dmatrix to the arith3
kernel, the kernel argument points only tho the relevant part of the Dmatrix, so that only the chunks can be used inside the kernel. The number of chunks is then the same as the number of slices in the central loop around arith3
in matvec
. This, however, breaks the current transposed
keyword as the kernel cannot access these indices of the dmatrix. The arith3_surface
kernel should work well with this modification, however I did not test it.
Maybe there is an elegant solution for the transposed indexing but it will require some changes in matvec
The version without working transposed is here https://github.com/mapclyps/adda/commit/2899c72efb60494b44b1c8b3115f66283adc9b13
@myurkin @mapclypes ... is this limitation of particle size still present in the current openCL code?
Could memory constraints be overcome if a good programmer puts in 1-2 months of developer time ?
I'd appreciate if you could give a quick feedback on this, there might be an opportunity to contribute.
@mapclyps see previous comment
@vondele Yes, unfortunately it is still in the current version of the code. Some experiments with my local modifications showed that the actual difference discussed in this issue can be pretty small. It is actually not about a general limit of the particle size due to the total GPU memory limitation but about an artificial allocation limit on GPUs for an single buffer in memory. That means that for example only 50% of the GPU memory is used as maximum while in theory only a slightly bigger particle could be calculated.
For the bigger goal, the removal of the GPU memory limit, one could think about if its worth transferring parts of the Dmatrix forth and back on each iteration in order to make the host memory, i.e. RAM, the limiting factor. However, memory transfers from GPU to host-memory are quite expensive, an overhead which is avoided at the moment. It would require some investigation if this is worth the effort.
If the latter is your concern @vondele we should maybe open another issue for a detailed discussion. What do you think @myurkin
OK, I see. Thanks.
I agree with @mapclyps that the problem is still present, but probably not so severe. I have just run a few simulations with the current version of the code compiled under Windows 10 64 bit with Nvidia GeforceGTX 1050 (apart from the Intel videocard). The following finishes well, although the maximum allocated object is 50% larger than declared possible by the driver:
adda_ocl -ntheta 10 -size 8 -m 1.05 0 -gpu 1 -grid 160
...
Using OpenCL device GeForce GTX 1050, based on NVIDIA CUDA.
Device memory: total - 2048 MB, maximum object - 512 MB
...
OpenCL memory usage: peak total - 1645.1 MB, maximum object - 759.4 MB
...
(The first three arguments are to limit the number of iterations and time for calculation of the scattered fields, and -gpu 1
chooses the Nvidia videocard. The memory is mostly required by -grid ...
).
But a bit larger simulation fails:
adda_ocl -ntheta 10 -size 8 -m 1.05 0 -gpu 1 -grid 160 160 180
...
OpenCL memory usage: peak total - 1823.8 MB, maximum object - 853.7 MB
...
ERROR: (../fft.c:294) CL error code -4: Memory object allocation failure
Then I have tried to save some memory by -opt mem
, resulting in
adda_ocl -ntheta 10 -size 8 -m 1.05 0 -gpu 1 -grid 160 160 180 -opt mem
...
OpenCL memory usage: peak total - 1496.8 MB, maximum object - 853.7 MB
...
finishes OK
but
adda_ocl -ntheta 10 -size 8 -m 1.05 0 -gpu 1 -grid 160 180 180 -opt mem
...
OpenCL memory usage: peak total - 1656.9 MB, maximum object - 959.8 MB
...
ERROR: (../oclmatvec.c:158) CL error code -4: Memory object allocation failure
A few additional notes:
- Both above errors propagate not from allocation calls, but rather from execution of OpenCL kernels. That is a known issue of delayed error reporting, which complicates the analysis.
- I have also tried
-iter bicg
to decrease the memory used by the iterative solver, but that seem to affect neither peak nor single-object memory. - Using
-prognosis
should be done with caution in this case due to #250.
To summarize the previous data, Nvidia definitely allows larger single objects than declares. The total memory may be the real limiting factor, e.g. at the level of 3/4 times the declared 2 GB. It probably also depends on other programs, including the desktop itself, which are currently running. Anyway, it seems that the original claim of being able to use only 1/3 of memory is not valid for the particular GPU that I used.
So, the first issue for optimization is to actually understand, what can be allocated at a particular GPU (or a range of GPUs), and how it relates to the driver declarations (and if anything is mandated by the OpenCL standard). That is not directly related to ADDA code.
If that analysis will show that decreasing the size of a single object (buffer) is beneficial, then the best option, in my opinion, is to separate the Dmatrix into several components, as described in the beginning of this issue. And this definitely can be done rather quickly by a good programmer.
That is indeed interesting. I tired just now with recent drivers under linux with a GTX 1070 (not used for Desktop). I was actually able to fill the memory completely with a maximum memory object size of more than half the entire GPU memory
$ adda_ocl -grid 276 -gpu 1
...
Device memory: total - 8120 MB, maximum object - 2030 MB
...
OpenCL memory usage: peak total - 8006.9 MB, maximum object - 4404.4 MB
The nvidia-smi output for the GPU during this adda run looks like
| 1 GeForce GTX 1070 Off | 00000000:08:00.0 Off | N/A |
| 0% 50C P2 103W / 151W | 8115MiB / 8119MiB | 100% Default |
I just tried a GTX 1080Ti (11GiB) with -grid 300
, that also worked
with peak total - 9892.1 MB, maximum object - 4976.9 MB
Even though it sort of obsoletes this issue, it ist still disturbing that the Nvidia driver returns wrong values for CL_DEVICE_MAX_MEM_ALLOC_SIZE. But good that we try to allocate the big buffer anyway regardless of what sizes are reported as supported by the driver. I have no AMD GPU at hand to try it, so I cannot update my statement in the first post of this issue.
It might work on Nvidia GPUs right now but it seems to be dependent on the driver. Probably it would be good to still split the Dmatrix to avoid potential issues with the maximum allocation right away.
This thread suggests that Nvidia developers (and others sometimes also) just take the simplest route and specify CL_DEVICE_MAX_MEM_ALLOC_SIZE to be the minimum possible value mandated by the standard. And modern drivers seem to be more forgiving in this respect. Then I guess, we can safely ignore the declared limit (you can never be really certain about the GPU memory anyway).
So to raise the priority of this issue, we need a practical example of some high-demanding application on a large GPU, which suffers from this issue. From the other hand, if anyone will fixes the issue it will be good for stability (including low dependence on driver version).