caldgemm
caldgemm copied to clipboard
Portable and Flexible DGEMM Library for GPUs (OpenCL, CUDA, CAL) with special support for HPL
//////////////////////////////////////////////////////////////////////////////////////////////////////////////// Caldgemm Readme, Command Line Options, Performance Optimization Guide, and examples ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Command Line Options of dgemm_bench: The parameters here are those of DGEMM bench, and the defaults are valid for DGEMM bench. Most parameters translate directly to a CALDGEMM setting, in that case the relevant CALDGEMM setting with its default in CALDGEMM is listed.
Some CALDGEMM settings are only valid for HPL-GPU. In that case, there is usually still a DGEMM bench option, to test this parameter. These parameters are marked (HPL-GPU Setting).
CALDGEMM provides 4 backends, for CAL, OpenCL, CUDA, and CPU. Some parameters are only valid for only one or some backends. This is noted as e.g. (CAL Runtime and OpenCL Runtime only).
CALDGEMM has two DMA Frameworks, one keeps the C matrix on GPU (GPU_C = 1), on keeps the C matrix on the host (GPU_C = 0). This is switched with -Oc switch. Some parameters are only valid for the either or the other case. This is noted as e.g. (GPU_C = 1 only). The CAL Runtime will always use GPU_C = 0, CUDA will always use GPU_C = 1, OpenCL supports both, for CPU backend this setting is ignored. In general, GPU_C = 1 should be favored when the GPU is much faster as CPU (i.e. with a multi-gpu system), GPU_C = 0 is better when GPU and CPU performance do not differ by more than a factor 4. The CPU_C = 0 option requires preprocessing (DivideBuffer) and postprocessing (MergeBuffer) on the host. Compared to GPU_C = 0, GPU_C = 1 requires half the global host memory bandwidth, but it requires full duplex DMA transfer instead of half duplex for GPU_C = 0.
-? Display help on command line options.
-e (default: disabled) Verify Computational Correctness. The matrix is copied at the beginning of the computation. Sufficient memory must be available. See -7 for verification of large matrices.
-q (default: disabled) Supress Display Output in caldgemm. Output from dgemm_bench is still active. See -5 to suppress this.
-a (default: disabled) (CAL Runtime Only) Print the disassembled kernel image
-i (default: disabled) (CAL Runtime and OpenCL Runtime Only) Print IL Kernel used
-if
-o <c|g> (default: 'c') Specify the output location, c = CPU, g = GPU, default GPU. This is the output location of the kernel. If 'g' is specified the GPU write to GPU global memory and an additional DMA transfer fetches the data to the host. In general 'c' is the faster option. On some systems DMA is slow and 'g' gets the better kernel performance. See -I in combination with the 'g' option!
-I (default: -1 = autodetect) (CAL Runtime Only) Force implicit driver sync. A bug in some AMD drivers prohibits DMA transfers and concurrent kernel execution in certain situations. Ths slows down caldgemm. A workaround is available that relies on a specific driver behavior and might result in wrong results with newer drivers. It is automatically detected whether you driver suffers by the bug and whether the workaround can be applied. This check does not work for newer driver versions though. -I forces the workaround enabled.
-^
-h
-H
-w
-W
-l (default: disabled) Automatically select tile-size for good performance. The -h paramameter defines the maximal size possible. -l parameter will use smaller tiles for smaller matrices. Activating this is generally a good idea.
-m
-n
-v (default: disabled) Verbose Synchronous Timing for Single Kernels / Transfers. This disables all asynchronous transfers in caldgemm. Overall performance will be poor. This can be used for directly measuring kernel performance and DMA performance and pre-/ postprocessing performance on CPU (pre-/postprocessing is only used for some operating modes.)
-k (default: disabled) (GPU_C = 0 Only) Print Timing of Asynchronous DGEMM Operation. Used for internal testing.
-r
-R
-y
-Y
-bb
-d (default: disabled) Print lots of debug output
-z (default: disabled) Enable Multithreading. You definitely want to activate this. For some internal reasons, this is a prerequisite to use multiple GPUs. MultiThreading means asynchronous processing of pre-/ postprocessing (required if GPU_C = 1 (-Oc parameter)). In addition, it is required for asynchronous factorization, broadcase, etc. in HPL-GPU.
-Z (default: disabled) Enabld Multithreading for DivideBuffer as well. Requires -z. Only valid for multiple GPUs. Use -Gx to set CPUs for GPU pre/postprocessing!.
-b (default: disabled) Enable internal benchmarking mode. Used for internal testing.
-c (default: disabled) Use CPU for DGEMM. You can supply -g as well to use both CPU and GPU. Supplying neither of them will use GPU only.
-g (default: enabled if and only if -c is disabled) Use GPU for DGEMM. You can supply -g as well to use both CPU and GPU. Supplying neither of them will use GPU only.
-f (default: disabled) Fast Init (Empty Matrices). The matrices are filled with zeros instead of using a random number generator. Initialization is faster. Use for optimization and benchmarking only. The verification does not work with this initialization method. Neither are the benchmark results correct with newer GPUs. Multiplication with zeroes drains less power, hence the GPU will run in turbo mode constantly, which is not true for standard random numbers.
-j
-jf
-jm
-jt
-js
-jl
-jp
-jq
-s (default: disabled) Dynamic CPU / GPU scheduling. Do not use only the fixed ratio specified by -j but use a dynamic CPU/GPU workload scheduling. This includes work-stealing, etc. The value provided by -j is the basis for the scheduling.
-M (default: disabled) Disable third phase in dynamic scheduling
-N (default: disabled) Disable second phase in dynamic scheduling
-rr (default: disabled) (HPL-GPU Setting) Rereserve Linpack CPU: HPL-GPU requires one GPU core for the broadcast. This core is not available for CPU DGEMM. CALDGEMM can estimate the broadcast time and then try split the DGEMM in two parts. One part in parallel to the broadcast with one core less, and then a second part after the broadcast with all cores. Makes sense when you are not GPU dominated and when you do not have too many cpu cores.
-p (default: disabled) Interleaving Memory Policy. Gotoblas usually activates memory interleaving. This leads to a problem with the CAL library. Interleaving should be activated after memory for the CAL library is allocated. Thus it is recommended to disable interleaving in GotoBLAS (apply the patch provided with caldgemm and set NO_MEMINTERLEAVE in GotoBLAS Make.rule) and use -p.
-u (default: disabled) Dump Test Matrix. Used for internal testing only.
-1 (default: disabled) Transpose A Matrix. Provide a transposed input A matrix.
-2 (default: disabled) Transpose B Matrix. Provide a transposed input B matrix.
-3 (default: disabled) Set alpha parameter to 1.0 to test optimized kernel.
-# (default: disabled) Set beta parameter to 0.0 to test optimized memcpy.
-5 (default: disabled) Quiet Benchmark mode (different from quiet caldgemm mode -q). This suppresses output of dgemm_bench. Output of caldgemm is not suppressed. See -q for this.
-6
-4
-7 (default: disabled) Verfication for large matrices. Compared to -e this does not require the matrix to be copied. However, the output is less elaborated and it only tells you whether the DGEMM succeeded.
-8 (default: initial run enabled) No initial run to negate cache effects. The first run is usually slower as the kernel must be copied to GPU, etc. Thus, for benchmarks, an initial run is performed before the actual benchmark run is started. The -8 option ommits this initial run. The initial run is automatically deactivated if the -d option or some other are given. This option is primarily used for debugging.
-9 (default: disabled) Output a table with timing information
-0 (default: disabled) (CAL Runtime only) Write the output of divideBuffers-function directly to GPU instead of a seperate DMA transfer. This option turned out to not perform well. Better leave it deactivated.
-A (default: disabled) Do the DMA transfer to GPU asynchronously. If you are not debugging, always enable this.
-L (default: disabled) Memory Organisation like in HPL (LINPACK). Do not pack the A, B, C matrices together but use a memory organisation like in HPL where the matrices are stored kind of interleaved.
-C (default: disabled) Call fake LINPACK callback functions. This is used to test the HPL callback implementation. For internal testing only.
-Ca
-P
-T (default: disabled) Allocate Memory using Huge Tables. Turned out not to perform well for some reasons. Better leave it deactivated. To activate this feature shared memory segments with huge tables must be provided.
-B (default: disabled) (CAL Runtime only) Keep DMA Buffers mapped during kernel execution. The Driver Hack is needed for this option. It is only relevant when using "-o c" which, however, is the default value.
-x
--
-t
-ts (default: disabled) Visualize the Thread affinities.
-tr
-K
-Gx
-Ux
-UAx
-UBx
-V
-S (deafult: not used) Set slow CPU option (see below)
-X (default: disabled) Do not use a round robin scheduler for multi-GPU but split the matrix along the not favored direction and process each part by a distinct GPU. This saves BBuffers and is usually faster. This is mandatory force very large matrices.
-Xb
-E (default: 0) Define random seed to use for matrix initialization. Use 0 for time.
-O (default: enabled) Define backend to use. Available options are: 0: CAL 1: OpenCL 2: CUDA 3: fCPU Only
-Oc
-Ol
-Oe (default: disabled) (OpenCL Runtime only) Do not allow multiple concurrent OpenCL kernels. Some OpenCL devices are slower when they execute multiple DGEMM kernels at the same time. This settings uses OpenCL events to enforce serialization of OpenCL kernels. It does not work well, and should not be used. It is better to enforce serialization on the driver side, e.g. on AMD cards via GPU_NUM_COMPUTE_RINGS=1 env variable.
-Oq (default: disabled) (OpenCL Runtime only, CUDA support planned) Use simple GPU Queuing for OpenCL. This comes with less overhead, so it is generally better for the GPU. But it is incompatible with GPU_C = 0 (-Oc option). If you use -Oc 1, you should also enable this. This enforces the Improved Scheduler (-X option)
-Op
-Oa (default: disabled) (OpenCL Runtime Only, CUDA support planned) (HPL-GPU Setting) CALDGEMM can run asynchronous side queues on the GPU to offload other tasks concurrent to DGEMM execution If this is set, DGEMM bench creates an async side queues and uses this queue to test a single-tile dgemm.
-Ox (default: disabled) (OpenCL Runtime Only) Do not put the CPU in the OpenCL context Can safe some OpenCL internal buffer sizes. Some OpenCL runtimes fail to allocate the large buffers required for -Oc 1. You should try whether it works, if yes fine, if not disable it.
-Ot (default: disabled) Use 3rdPartyTranspose kernel for matrix transposition, which is provided by 3rd party external library (See -Ol setting)
-F (default: 0) Define OpenCL Platform ID to use.
-J
-Q (default: disabled) Wait for pressing a key before exiting
-! (default: disabled) Do not use page locked memory
-_ (default: disabled) (OpenCL Runtime and CUDA Runtime only) Allocate memory using the GPU runtime library (e.g. OpenCL) instead of malloc. This is required for using GPU_C = 1 (-Oc 1 option) in combination with -o c. In general, it is usually faster with GPU_C = 1 regardless of whether -o g or -o c is used. Some drivers do not support this properly.
-=
-% (default: disabled) Skip CPU Pre- and Postprocessing. Leads to incorrect results. For internal testing only
-@ (default: disabled)
Comma or Semicolon separated list of CPU cores to exclude. This is usefull if you run something in parallel
to CALDGEMM. Or if you have a bulldozer or HyperThreading GPU and you want to disable all even or all
odd numbered cores. In general, it is a good idea to disable HyperThreading for CALDGEMM.
-. (default: disabled) (CAL Runtime only) Repin Main Thread During Active Wait for GPU Event. This is a workaround required for the CAL Runtime on Sandy-Bridge-E CPUs. It costs performance, so only enable when needed.
-~ (default: disabled) Always repin main thread. This is an alternate workaround for Sandy-Bridge-E CPUs (see -. option)
-,
-: (default: disabled) Enable NUMA Pinning. This tries to distribute all employed CPU threads evenly among NUMA nodes. Has little effect, does not always work, but has practically never a negative effect.
-/ (default: disabled)
Comma or Semicolon separated list of GPU devices to use (replaces -y for multiple devices)
Usually, -Y 3 will use GPU devices 0, 1, and 2, while -y 3 will use only GPU device 3.
This gives more fine-grained control on which GPU devices to use. On NUMA systems, it can be beneficial to
interleave devices on different NUMA nodes.
-*
-[
-]
Other CALDGEMM Options:
The CALDGEMM config allow the SlowCPU option which should be used when the CPU is comparably slow compared to the GPU. It deactivates 2nd and 3rd phase runs and adjusts the tiling size to minimize the 1st phase CPU run.
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Performance Optimization Guide:
To achieve good performance multiple steps should be performed: 0. Update Settings for the GPU used
- Optimize Kernel Performance.
- Optimize System Performance of GPU-DGEMM (including DMA-transfer, post-/ preprocessing).
- Optimize Combined GPU/CPU Performance.
- Optimize multi-GPU performance.
If you have multiple GPUs better do the following with a single GPU first. Try multiGPU afterwards (step 4). Add -y 0 to each of the following command lines at the beginning.
In principle, you should try to achieve the following performance: The kernel performance dictates the final performance. Kernel performance is usually 80%-90% of the theoretical peak performance of the GPU. The CAL kernel should achieve 574 GFLOPS with 5870 GPU, 623 GFLOPS with 6970 GPU, 805 GFLOPS with 7970 GPU, to give a rough overview.
Goint from single GPU kernel performance to single GPU system performance, you should expect a loss of 1%-3%. Scaling to multi-GPU should be almost perfect for 2 GPUs (less than 2% loss) and for 4 GPUs you should expect less than 4% less.
If you then go to HPL, a rough guideline is that HPL should achieve 7%-15% less GFLOPS than DGEMM, while multi-node HPL will encounter an additional 5%-10% loss.
The following procedure is mostly for CAL. Additional suggestions for OpenCL and CUDA follow later. Still, many aspects of the CAL guide are also valid for OpenCL / CUDA.
Some general remarks at the beginning: CALDGEMM by default uses pinned host memory, which cannot be swapped. It might be necessary to set ulimits accordingly: ulimit -m unlimited; ulimit -l unlimited; ulimit -v unlimited;
Some GPUs throttle themselves during DGEMM execution. For AMD GPUs, you can use the "atitweak" python utility to modify the GPU poertune feature (atitweak -p) to overcome this. Keep in mind that this might run the GPU our of spects, so it can damage your hardware if done incorrectly. This is at your own risk. You should at least monitor temperature constantly if doing so.
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Step 0: Different GPUs require different settings for optimal performance.
Especially the splitting ratio calculation may not work correctly. Always keep an eye on the GPU time and the CPU time. If one of them is higher then the other, adjust the -j ratio. This is also relevant for the 5000 series due to different clock speeds.
CALDGEMM comes with Assembler GPU DGEMM kernels for the CAL runtime. Depending on the particular GPU used, the options in caldgenn_config.h should be adjusted for optimal DGEMM performance.
For the 5xxx series, the following is suggested: Enable exactly CALDGEMM_TRANSPOSED_B, CALDGEMM_44 as DGEMM kernel settings in caldgemm_config.h For 5xxx series h can be used almost arbitrarily but is suggested to be at least 1024. 5xxx works well both with -o g and -o c
For the 6xxx series the following configuration is suggested: Enable CALDGEMM_TRANSPOSED_B, CALDGEMM_44. It is best to enable the CALDGEMM_44_BT_64 and CALDGEMM_44_BT_64_CONVERT options in caldgemm_config.h. h = 2304 performs best. Use -o c in any case! See that implicit driver sync works (-I), or use DMA fetch queue (-^).
For the 7xxx series, please enable the following settings (default): CALDGEMM_TRANSPOSED_A, CALDGEMM_44, CALDGEMM_DUAL_ENTRY, CALDGEMM_LATE_EXIT_CONDITION, CALDGEMM_SHIFT_TEXTURE 1 h = 3072 works well. -o g works usually better than -o c
In general, it is no longer suggested to use CAL. OpenCL and CUDA are the better options. OpenCL comes only with a reference kernel, it has support to load an optimized kernel from a 3rd party library. This is the suggested way. CUDA also comes only with a reference kernel yet, this should be changed to CUBLAS in the future.
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Step 1: The kernel performance should be good out of the box. Most kernel parameters cannot be changed via command-line but during compilation in caldgemm_config.h. Usually the parameters are fine as they are.
Run a "./dgemm_bench -v" to check the kernel performance. The kernel will usually write its output to host memory.
Some systems have a poor DMA. You can try to alter the output to GPU memory and see whether kernel performance gets better. Run "./dgemm_bench -o g -v" for this. If the second option is better, always use "-o g". For OpenCL and for 7xxx AMD series and above, -o g is suggested in general.
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Step 2: Optimize System performance
First check whether DMA is working well. Run "./dgemm_bench -o g -v" and look at the copy speeds from and to the device. (-o g is required here to measure PCIe speed.) Anything above 5gb/s should be fine. If the speed is below probably the GPU threads are pinned to a wrong CPU core on NUMA architectures. You can alter the CPU core with the -t option. Try "./dgemm_bench -o g -v -t 0", "./dgemm_bench -o g -v -t 1", etc to find the best CPU core. Using a CPU core other than zero can lead to problems when using GPU/CPU combined DGEMM.
Test you system GPU DGEMM performance. The parameters you definitely want to have are: -z (multithreading) -p (memory interleaving) -A (asynchronous DMA transfer) Run "./dgemm_bench -z -p -A -m 40960 -n 40960"
This part is only relevant if you found you want to use "-o g" in Step 1: There is a DMA problem in the AMD driver that can be overcome by a workaround. Usually it is autotedected whether the workaround can and must be applied. Still, you better recheck by hand. You can force the workaround using the -I parameter. Rerun the above test: "./dgemm_bench -z -p -A -m 40960 -n 40960 -o g -I" If the performance is better you have to check whether the results are correct. The workaround will only work with some drivers and might produce false results with others. To verify run: "./dgemm_bench -z -p -A -m 40960 -n 40960 -o g -I -e"
This part is only relevant if you found you want to use "-o c" in Step 1: Use the AMD driver hack. Apply the hack and then use the "-B" parameter. Run "./dgemm_bench -z -p -A -B -m 40960 -n 40960". You'll see a warning if the hack was not applied correctly. Performance is not necessarily better than without "-B" but the CPU load is decreased. You'll see the difference when using combined CPU/GPU DGEMM.
If you have an Ivy-Bridge system with CAL runtime, add -. option.
On intel systems, you can usually restrict to one output thread with -= 1 option.
If you have much more GPU power then CPU power, -J 1 is suggested, and perhaps disable dynamic CPU/GPU scheduling (no -s).
You can interleave GPUs among numa nodes with -/ setting (see quad-GPU 7xxx series example below).
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Step 3: Optimize Overall performance.
First check the possible CPU performance: "./dgemm_bench -c -z -p -m 40960 -n 40960". Then do a combined CPU/GPU run: "./dgemm_bench -c -g -l -s -p -z -A -m 40960 -n 40960". Use the "-o g", "-I", and "-B" parameters as determined in steps 1 and 2. The performance should be better than in step 2.
You can alter the CPU/GPU ratio using the "-j" parameter. Try to tune it such that the GPU and CPU DGEMM times are equal. It is better to set -j rather high, as the dynamic scheduler will compensate this with a work-stealing algorithm. If you see many 3rd-phase runs in caldgemm output, than "-j" is possibly to big.
If the AMD driver hack is not available, you might get better combined performance by using "-o g" (foolow the appropriate instructions in step 2 also).
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Step 4: There is little you can do to optimize multi-GPU performance. You have to determine the CPU core for each GPU independently. Repeat this part of step2. Use -y 0, -y 1, -y 2 etc to optimize each gpu. Finally use -G0 ? -G1 ? -G2 ? and insert the optimal cpu core you obtained for each GPU.
Next step is tuning -Ux settings. Try if Parallel DMA mode and grouped DMA mode yield a benefit.
First try to run without CPU. From now on omit the "-y 0". The performance should scale almost linearly with multi-GPU.
You can try the -X and -Z options. They usually increase performance for 3 GPUs or more. You might also want to increase w. w = 1536 or w = 2048 can achieve good performance. For larger w a smaller h is suggested. Try h = 3072 for instance.
If you have good multi-GPU performance try to use the CPU as well. You might need to change the -j value. Best, start with -j 1 to do almost all work of the CPU. Then decrease j step by step until you see optimal performance. (-j 0.97 ... -j 0.94 ... -j 0.91).
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Guidelines for OpenCLL / CUDA: The CUDA part is not fully implemented yet. This guide is written as if it was fully integrated, feel free to implement the missing features for CUDA yourself :|.
The most important thing for OpenCL is the 3rd party library for the DGEMM kernel. CALDGEMM itself comes only with an unoptimized reference implementation. There is a sample 3rd party library with a template that shows how such a library has to work. In caldgemm_config.h there are also some options to tweak the integrated OpenCL kernels's performance. Important aspects here are ENABLE_TILED_KERNEL and DISABLE_SIMPLE_BUFFERS, but performance will anyway be much less than with proper 3r party kernel.
In general, you should try to use OpenCL with GPU_C = 1. It is almost always better. Only in the case of a com- paratively fast CPU (like 2 * 12 core CPU + slow GPU like 5870), the GPU_C = 0 option is possibly faster. In general, GPU_C = 0 works better with CAL, which is usually around 5% faster than OpenCL. So if you want to test is, CAL is probably the way to go (although no longer supported for newer GPUs).
OpenCL with GPU_C = 0 setting has almost identical behavior as CAL, so please follow the above guide. The following refers to OpenCL with GPU_C = 1 and to CUDA.
OpenCL with GPU_C = 1 will transfers tile of the C matrix completely to the GPU using strided submatrix transfers. There are no intermediate host buffers.
Therefore, there are no pre-/postprocessing threads on the CPU.
Due to the nature of GPU_C = 1, the GPU pinnin has practically no influence (except perhaps of device API internal buffers, which can be pinned to the either or the other GPU.) Hence, it makes sense to set the -UAx settings as described above for the -Gx settings, as it comes at zero cost. But it is not really necessary. It works well without.
In general, you will want to use device-runtime-allocated memory. It is usually mich faster than plain malloc. The -o c setting enforces device-runtime-allocated memory for OpenCL in any case. For this you need the -_ option. Be aware that some OpenCL drivers have problems allocated the large buffers required. If this leads to memory allocation problems, you should first try to fix this driver issue before you start to disable device-runtime-allocated memory.
The baseline for OpenCL will thus be something like
./dgemm_bench -O 1 -Oc 1 -o g -_ -Ol my_opencl_3rd_party_lib.so -w 1920 -h 3072 -UAx... -A -c -z -X -p -m ... -n ...
The most relevant optimization settings are: -Oq (enable simple queuing, almost always faster) -J 1 (enable small tiles) -bb ? (choose correct number of bbuffers) -Op ? (chose correct preallocation setting) -Ox (exclude CPU from context) -Ot (improved transposition kernel) -Xb 1/2 (improved scheduler balancing)
Of course you do not need -X -Xb if you use only a single GPU
////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Examples:
Measure kernel and PCIe performance: ./dgemm_bench -o g -v
Run GPU-only DGEMM ./dgemm_bench -z -p -A -B -m 40960 -n 40960 ./dgemm_bench -z -p -A -o g -I -m 40960 -n 40960
Run CPU/GPU DGEMM ./dgemm_bench -c -g -z -s -l -p -A -B -y -1 -j -1 -m 40960 -n 40960 ./dgemm_bench -c -g -z -s -l -p -A -o g -I -y -1 -j -1 -m 40960 -n 40960
Run Multi-GPU/CPU DGEMM ./dgemm_bench -c -g -z -s -l -p -A -B -m 89088 -n 89088 -X -Z -w 1536 -h 3072 -G0 0 -G1 0 -G2 12 -j 0.91
Example of quad-GPU without CPU 6xxx series (2 * 12 core AMD Magny-Cours system, GPUs 0,1, connected to numa node 0 (cores 0-11)) ./dgemm_bench -g -A -Z -X -p -w 2048 -h 2304 -o g -I 1 -4 123000 -G0 0 -G1 12 -G2 1 -G3 13 -U0 2 -U1 14 -U2 4 -U3 16 -UA0 0 -UA1 12 -UA2 0 -UA3 12 -K 0 -z -= 2 -/ 0,2,1,3 -J 1
Example of quad-GPU+CPU 7xxx series DGEMM (2 * 8 core numa Ivy-Bridge system, GPUs 0,1 connected to numa node 0 (cores 0-7)) ./dgemm_bench -g -A -Z -X -p -w 1920 -h 3072 -o g -I 1 -4 123000 -G0 0 -G1 8 -G2 0 -G3 8 -U0 1 -U1 9 -U2 2 -U3 10 -UA0 0 -UA1 8 -UA2 0 -UA3 8 -K 0 -z -= 1 -. -/ 0,2,1,3 -c -j 0.955 -J 1
Example as above, with Parallel DMA mode ./dgemm_bench -g -A -Z -X -p -w 1920 -h 3072 -o g -I 1 -4 123000 -G0 0 -G1 8 -G2 0 -G3 8 -U0 1 -U1 9 -U2 2 -U3 10 -UA0 0 -UA1 8 -UA2 0 -UA3 8 -UB1 8 -UB2 4 -UB3 12 -K 0 -z -= 1 -. -/ 0,2,1,3 -* 1000000 -c -j 0.955 -J 1
Example as above, with grouped DMA mode ./dgemm_bench -g -A -Z -X -p -w 1920 -h 3072 -o g -I 1 -4 123000 -G0 0 -G1 8 -G2 0 -G3 8 -U0 1 -U1 9 -U2 2 -U3 10 -UA0 0 -UA1 8 -UA2 0 -UA3 8 -UB1 8 -UB2 4 -UB3 12 -K 0 -z -= 1 -. -/ 0,2,1,3 -* 1000000 -[ 1000000 -c -j 0.955 -J 1
GPU only torture test for device 0 ./dgemm_bench -y 0 -- 100
GPU/CPU torture test ./dgemm_bench -- 100 -c -g -s -z -p
Single GPU with OpenCL ./dgemm_bench -O 1 -w 1920 -h 2976 -_ -Ol amddgemm_hawai.so -A -z -p -g -o g -6 20
Single GPU with OpenCL and advanced options ./dgemm_bench -O 1 -w 1920 -h 2976 -_ -Ol amddgemm_hawai.so -A -z -p -g -J 1 -: -o g -6 20 -Oq -bb 15 -Op 20 -Ox -Ot
Multi-GPU with OpenCL (2 * 10 core numa system) ./dgemm_bench -O 1 -w 1920 -h 2976 -_ -Ol amddgemm_hawai.so -A -z -p -X -Xb 2 -g -J 1 -: -UA0 0 -UA1 10 -UA2 0 -UA3 10 -K 0 -/ 0,2,1,3 -o g -6 58 -Oq -bb 15 -Op 60 -Ox -Ot -Y 4
Multi-GPU + CPU with OpenCL ./dgemm_bench -O 1 -w 1920 -h 2976 -_ -Ol amddgemm_hawai.so -A -z -p -X -Xb 2 -g -J 1 -: -UA0 0 -UA1 10 -UA2 0 -UA3 10 -K 0 -/ 0,2,1,3 -o g -6 58 -Oq -bb 15 -Op 60 -Ox -Ot -Y 4 -c -j 0.972
Since the full list of parameters can be a bit overwhelming, it follows a list of common parameters required for good performance:
General Parameters: -? -e -o -h -w -l -m -n -v -R -y -Y -bb -d -z -c -g -f -j -p -1 -2 -4 -6 -K -X -Xb -O -J -: -/ -@ -] Parameters for CAL / GPU_C = 0: -I -^ -Z -s -M -N -rr -B -Gx -Ux -UAx -UBx -. -= -* -[ Parameters for OpenCL / CUDA / GPU_C = 1: -tr -UAx -Oc -Oq -Ol -Op -Ox -Ot -F -_ Parameters for HPL: -Ca -Oa
////////////////////////////////////////////////////////////////////////////////////////////////////////////////