npbench
npbench copied to clipboard
fpga benchmark support
Currently, npbench seems only support cpu and gpu, is there any support for fpga? thanks
There is no support for automatically compiling the DaCe versions for FPGA through the run_framework
and run_benchmark
scripts. However, the capability exists in DaCe if you have the necessary toolchains installed. I will check if we can add experimental support in NPBench and get back to you.
thanks
@alexnick83 is there any experimental code that I can reproduce some fpga performance data shown in the paper? THanks
@alexnick83 is there any experimental code that I can reproduce some fpga performance data shown in the paper? THanks
Yes, apart from the paper's artifact, there are tests in the DaCe repository. In the paper, the samples under polybench
were run. Note that the FPGA tests may have some new transformations compared to the paper, but I suppose you are looking for the latest developments.
@alexnick83 thanks for the info, how ever, I tried to benchmark the test under polybench, it seems dace_cpu and dace_gpu is mush slower than numpy(8-10x slower), such as the following code(cholesky_test.py), I have precompile sdfg with sdfg.compile. do you have any suggestions? thanks very much
''' if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("-t", "--target", default='cpu', choices=['cpu', 'gpu', 'fpga'], help='Target platform')
args = vars(parser.parse_args())
target = args["target"]
sdfg = None
if target == "cpu":
sdfg = run_cholesky(dace.dtypes.DeviceType.CPU)
elif target == "gpu":
sdfg = run_cholesky(dace.dtypes.DeviceType.GPU)
elif target == "fpga":
sdfg = run_cholesky(dace.dtypes.DeviceType.FPGA)
N = sizes["medium"]
A = init_data(N)
gt_A = np.copy(A)
sdfg_binary=sdfg.compile()
start = time.time()
for i in range(10):
sdfg_binary(A=A, N = N)
end = time.time()
print("acclerator ", target, " time is ", (end - start)*100, "ms")
start = time.time()
for i in range(10):
ground_truth(N, gt_A)
end = time.time()
print("numpy time is ",(end-start)*100, "ms")
'''
For performance runs on CPU and GPU, I would use NPBench and not the DaCe tests. Their purpose is to keep track of functional regressions in the auto-optimizer for parameters controlled by the CI (for example, use simplify
when generating the initial SDFG or not). Furthermore, the tests use, by default, a very small dataset size to finish execution fast. Therefore, you may be measuring library overheads on some of them. Still, the CPU being 8-10x slower seems strange. From the latest NPBench data (latest master branch), DaCe CPU is 14.3x faster than NumPy, and GPU is 4.6x slower than NumPy, on the same hardware and dataset as in the paper, which matches more or less the results shown. Another thing to note is that you must have optimized BLAS libraries installed for CPU execution. For example, if a test has matrix multiplication, but DaCe cannot find MKL (or OpenBLAS), it will generate the equivalent of the naive algorithm, which will run painfully slow.
I just ran the modified Cholesky test you posted on my main machine (i7 7700). This is what I got for CPU and different parameters:
automatic_simplication=False
acclerator cpu time is 2.161860466003418 ms
numpy time is 202.2195816040039 ms
automatic_simplication=False, OMP_NUM_THREADS=4
acclerator cpu time is 2.292203903198242 ms
numpy time is 137.15152740478516 ms
automatic_simplication=True
acclerator cpu time is 1.9000768661499023 ms
numpy time is 134.70430374145508 ms
automatic_simplication=True, OMP_NUM_THREADS=4
acclerator cpu time is 1.9898653030395508 ms
numpy time is 136.15012168884277 ms