BladeDISC
BladeDISC copied to clipboard
issues about the Quickstart for TensorFlow Users tutorial
Hello, I'm trying the Quickstart for TensorFlow Users in https://github.com/alibaba/BladeDISC/blob/main/docs/quickstart.md#quickstart-for-tensorflow-users. It seems that there is an obvious mistake in the simple demo:
g = tf.Graph()
with g.as_default():
# reduce_sum((x + y) * c) ^ 2
x = tf.placeholder(shape=[None, None], dtype=tf.float32, name="x")
y = tf.placeholder(shape=[None, None], dtype=tf.float32, name="y")
c = tf.constant([[1], [2]], dtype=tf.float32, shape=(2, 1), name="c")
t1 = x + y
t2 = tf.matmul(t1, c)
t3 = tf.reduce_sum(t2)
ret = t3 * t3
with tf.Session() as s:
np_x = np.ones([10, 10]).astype(np.float32)
np_y = np.ones([10, 10]).astype(np.float32)
r = s.run(ret, {x: np_x, y: np_y})
print("x.shape={}, y.shape={}, ret={}".format(np_x.shape, np_y.shape, r))
The sizes of the inputs of the matmul op are incompatible: In[0]: [10,10], In[1]: [2,1] Another problem which is more important that I encounter is that this demo runs very slowly in the bladedisc docker. I build the bladedisc docker following this doc: https://github.com/alibaba/BladeDISC/blob/main/docs/install_with_docker.md. The docker tag is latest-runtime-tensorflow1.15. In this docker container, the above demo takes more than 3 mins to finish with the disc disabled, and more than 6 mins with the disc enabled. It is slower with disc enabled. However, in another docker container on the same machine, which is also an environment of tensorflow 1.15 and cuda 11.0, the demo finishs within 5 seconds with disc disabled. Does anyone else encounter similar problem with this docker image?
1, Oh, it's indeed a bug. Thanks. Could you send a PR as well?
2, As for you second question, do you use the same TF python package in the two different containers? It seems that official TF 1.15 package does not support CUDA 11.0. Do you build from source by yourself? If so, what’s the setting for CUDA device capability (e.g. SM_75/80) when you built TF? And which kind of cuda device do you use to run the test case (e.g. T4, A10)? Is the CUDA device capability of the device you run your test case in the capability setting? If not, TF will JIT compile all the cuda kernels for the pre-defined ops at runtime, which may take a lot of time.
Thanks for your reply! The TF in both the two containers I used comes from the docker images themselves. The TF in bladedisc docker image is version 1.15.0 and that in another docker image which has normal performance is version 1.15.4+nv. My cuda device is A100. Maybe the TF in the bladedisc docker image was not built for SM_80?