npbench fpga benchmark support

Currently, npbench seems only support cpu and gpu, is there any support for fpga? thanks

Oct 20 '22 01:10 jzhoulon

There is no support for automatically compiling the DaCe versions for FPGA through the run_framework and run_benchmark scripts. However, the capability exists in DaCe if you have the necessary toolchains installed. I will check if we can add experimental support in NPBench and get back to you.

Oct 21 '22 18:10 alexnick83

thanks

Oct 27 '22 15:10 jzhoulon

@alexnick83 is there any experimental code that I can reproduce some fpga performance data shown in the paper? THanks

Oct 28 '22 15:10 jzhoulon

@alexnick83 is there any experimental code that I can reproduce some fpga performance data shown in the paper? THanks

Yes, apart from the paper's artifact, there are tests in the DaCe repository. In the paper, the samples under polybench were run. Note that the FPGA tests may have some new transformations compared to the paper, but I suppose you are looking for the latest developments.

Oct 28 '22 15:10 alexnick83

@alexnick83 thanks for the info, how ever, I tried to benchmark the test under polybench, it seems dace_cpu and dace_gpu is mush slower than numpy(8-10x slower), such as the following code(cholesky_test.py), I have precompile sdfg with sdfg.compile. do you have any suggestions? thanks very much

''' if name == "main":

parser = argparse.ArgumentParser()
parser.add_argument("-t", "--target", default='cpu', choices=['cpu', 'gpu', 'fpga'], help='Target platform')

args = vars(parser.parse_args())
target = args["target"]
sdfg = None
if target == "cpu":
    sdfg = run_cholesky(dace.dtypes.DeviceType.CPU)
elif target == "gpu":
    sdfg = run_cholesky(dace.dtypes.DeviceType.GPU)
elif target == "fpga":
    sdfg = run_cholesky(dace.dtypes.DeviceType.FPGA)

N = sizes["medium"]
A = init_data(N)
gt_A = np.copy(A)
sdfg_binary=sdfg.compile()
start = time.time()
for i in range(10):
  sdfg_binary(A=A, N = N)
end = time.time()
print("acclerator ", target, " time is ", (end - start)*100, "ms")

start = time.time()
for i in range(10):
    ground_truth(N, gt_A)
end = time.time()
print("numpy time is ",(end-start)*100, "ms")

'''

Oct 30 '22 14:10 jzhoulon

For performance runs on CPU and GPU, I would use NPBench and not the DaCe tests. Their purpose is to keep track of functional regressions in the auto-optimizer for parameters controlled by the CI (for example, use simplify when generating the initial SDFG or not). Furthermore, the tests use, by default, a very small dataset size to finish execution fast. Therefore, you may be measuring library overheads on some of them. Still, the CPU being 8-10x slower seems strange. From the latest NPBench data (latest master branch), DaCe CPU is 14.3x faster than NumPy, and GPU is 4.6x slower than NumPy, on the same hardware and dataset as in the paper, which matches more or less the results shown. Another thing to note is that you must have optimized BLAS libraries installed for CPU execution. For example, if a test has matrix multiplication, but DaCe cannot find MKL (or OpenBLAS), it will generate the equivalent of the naive algorithm, which will run painfully slow.

Oct 31 '22 09:10 alexnick83

I just ran the modified Cholesky test you posted on my main machine (i7 7700). This is what I got for CPU and different parameters:

automatic_simplication=False

acclerator  cpu  time is  2.161860466003418 ms
numpy time is  202.2195816040039 ms

automatic_simplication=False, OMP_NUM_THREADS=4

acclerator  cpu  time is  2.292203903198242 ms
numpy time is  137.15152740478516 ms

automatic_simplication=True

acclerator  cpu  time is  1.9000768661499023 ms
numpy time is  134.70430374145508 ms

automatic_simplication=True, OMP_NUM_THREADS=4

acclerator  cpu  time is  1.9898653030395508 ms
numpy time is  136.15012168884277 ms

Oct 31 '22 09:10 alexnick83

npbench npbench copied to clipboard

fpga benchmark support

npbench
npbench copied to clipboard