bohrium icon indicating copy to clipboard operation
bohrium copied to clipboard

Performance drop on GPU when casting to float

Open dionhaefner opened this issue 8 years ago • 6 comments

Consider the following benchmark:

import time

import numpy as np

a = np.random.rand(20000, 20000)

def bench():
    b = a ** 2
    c = a * b
    d = c * np.sum(a, axis=1)[:, None] - a ** 3 + 17.2
    e = np.sum(a + b + c + d)
    return e

while True:
    start = time.time()
    res = bench()
    try:
        np.flush()
    except AttributeError:
        pass
    end = time.time()
    print("result: {:.2e}; time: {:.2f}s".format(float(res), end-start))

With the OpenCL backend, I get a throughput of up to 100 GFLOPS. However, if I change the definition to

def bench():
    b = a ** 2
    c = a * b
    d = c * np.sum(a, axis=1)[:, None] - a ** 3 + 17.2
    e = np.sum(a + b + c + d)
    return float(e)

performance degrades to a mere 5 GFLOPS without any additional warnings.

dionhaefner avatar Oct 26 '17 21:10 dionhaefner

The difference in terms of kernels and traces here, is that in the top version some of the intermediate arrays are correctly freed.

We will have to look into why this is happening, as the arrays are clearly temporary in both cases and should thus be freed.

TOP

Trace 0 (syncs: a0):
BH_RANDOM a1[0:40000:1] {.start = 0, .key = 1109971011}
BH_IDENTITY a2[0:200:200,0:200:1] a1[0:200:200,0:200:1]
BH_DIVIDE a3[0:200:200,0:200:1] a2[0:200:200,0:200:1] 1.84467440737095516e+19
BH_FREE a2[0:40000:1]
BH_FREE a1[0:40000:1]
BH_MULTIPLY a4[0:200:200,0:200:1] a3[0:200:200,0:200:1] a3[0:200:200,0:200:1]
BH_MULTIPLY a5[0:200:200,0:200:1] a3[0:200:200,0:200:1] a4[0:200:200,0:200:1]
BH_ADD_REDUCE a6[0:200:1] a3[0:200:200,0:200:1] 1
BH_MULTIPLY a7[0:200:200,0:200:1] a5[0:200:200,0:200:1] a6[0:200:1,0:200:0]
BH_FREE a6[0:200:1]
BH_POWER a8[0:200:200,0:200:1] a3[0:200:200,0:200:1] 3.00000000000000000e+00
BH_SUBTRACT a9[0:200:200,0:200:1] a7[0:200:200,0:200:1] a8[0:200:200,0:200:1]
BH_FREE a7[0:40000:1]
BH_FREE a8[0:40000:1]
BH_ADD a10[0:200:200,0:200:1] a9[0:200:200,0:200:1] 1.71999999999999993e+01
BH_FREE a9[0:40000:1]
BH_ADD a11[0:200:200,0:200:1] a3[0:200:200,0:200:1] a4[0:200:200,0:200:1]
BH_ADD a12[0:200:200,0:200:1] a11[0:200:200,0:200:1] a5[0:200:200,0:200:1]
BH_FREE a11[0:40000:1]
BH_ADD a13[0:200:200,0:200:1] a12[0:200:200,0:200:1] a10[0:200:200,0:200:1]
BH_FREE a12[0:40000:1]
BH_ADD_REDUCE a14[0:200:1] a13[0:200:200,0:200:1] 1
BH_ADD_REDUCE a0[0:1:1] a14[0:200:1] 0
BH_FREE a14[0:200:1]
BH_FREE a13[0:40000:1]
BH_FREE a4[0:40000:1]
BH_FREE a5[0:40000:1]
BH_FREE a10[0:40000:1]

Trace 1 (syncs:):
BH_FREE a0[0:1:1]
BH_FREE a3[0:40000:1]

BOTTOM

Trace 0 (syncs: a0):
BH_RANDOM a1[0:40000:1] {.start = 0, .key = 2282337539}
BH_IDENTITY a2[0:200:200,0:200:1] a1[0:200:200,0:200:1]
BH_DIVIDE a3[0:200:200,0:200:1] a2[0:200:200,0:200:1] 1.84467440737095516e+19
BH_FREE a2[0:40000:1]
BH_FREE a1[0:40000:1]
BH_MULTIPLY a4[0:200:200,0:200:1] a3[0:200:200,0:200:1] a3[0:200:200,0:200:1]
BH_MULTIPLY a5[0:200:200,0:200:1] a3[0:200:200,0:200:1] a4[0:200:200,0:200:1]
BH_ADD_REDUCE a6[0:200:1] a3[0:200:200,0:200:1] 1
BH_MULTIPLY a7[0:200:200,0:200:1] a5[0:200:200,0:200:1] a6[0:200:1,0:200:0]
BH_FREE a6[0:200:1]
BH_POWER a8[0:200:200,0:200:1] a3[0:200:200,0:200:1] 3.00000000000000000e+00
BH_SUBTRACT a9[0:200:200,0:200:1] a7[0:200:200,0:200:1] a8[0:200:200,0:200:1]
BH_FREE a7[0:40000:1]
BH_FREE a8[0:40000:1]
BH_ADD a10[0:200:200,0:200:1] a9[0:200:200,0:200:1] 1.71999999999999993e+01
BH_FREE a9[0:40000:1]
BH_ADD a11[0:200:200,0:200:1] a3[0:200:200,0:200:1] a4[0:200:200,0:200:1]
BH_ADD a12[0:200:200,0:200:1] a11[0:200:200,0:200:1] a5[0:200:200,0:200:1]
BH_FREE a11[0:40000:1]
BH_ADD a13[0:200:200,0:200:1] a12[0:200:200,0:200:1] a10[0:200:200,0:200:1]
BH_FREE a12[0:40000:1]
BH_ADD_REDUCE a14[0:200:1] a13[0:200:200,0:200:1] 1
BH_ADD_REDUCE a0[0:1:1] a14[0:200:1] 0
BH_FREE a14[0:200:1]
BH_FREE a13[0:40000:1]

Trace 1 (syncs:):
BH_FREE a0[0:1:1]
BH_FREE a4[0:40000:1]
BH_FREE a5[0:40000:1]
BH_FREE a10[0:40000:1]
BH_FREE a3[0:40000:1]

omegahm avatar Oct 26 '17 21:10 omegahm

It is not really a bug.

When calling float(e) you are syncing e to numpy, but all the other arrays a, b, c, d are still in scope, which makes them impossibly to temporary-array-eliminate. Notice, this is not a problem in the first version since a, b, c, d are deleted when the function returns.

Unless we do some code transformation magic, we cannot really fix this problem :/

madsbk avatar Oct 30 '17 15:10 madsbk

Why this dramatic performance-drop, though? I am eventually syncing e to numpy either way, so it shouldn't matter when that happens, right?

dionhaefner avatar Oct 30 '17 16:10 dionhaefner

In the first case, b, c, d, is never allocate because Bohrium detects that they are temporary arrays within bench(). That makes a huge difference.

madsbk avatar Oct 31 '17 11:10 madsbk

I see. I think there should be a guide on how to write efficient code with Bohrium - seems that even after a year of usage there are still many surprises.

dionhaefner avatar Nov 02 '17 21:11 dionhaefner

Sorry for reviving an old thread, I don't know if it's still relevant.

I discussed a similar issue with @madsbk today. I think the current solution is to convert to float after you've returned from the scope of bench(). Then Bohrium will still detect the temporary arrays. Alternatively, call del on the four temp arrays, to make it clear, that they are temporary.

Baekalfen avatar Nov 13 '18 12:11 Baekalfen