inference icon indicating copy to clipboard operation
inference copied to clipboard

BERT on TensorFlow fails

Open coppock opened this issue 2 years ago • 5 comments

On r2.1, the Docker container run fails as shown:

(mlperf) $ python3 run.py --backend=tf --scenario=Offline
.
.
.
Running LoadGen test...
2022-11-02 14:19:55.493183: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2022-11-02 14:30:24.704249: E tensorflow/stream_executor/cuda/cuda_blas.cc:440] failed to run cuBLAS routine: CUBLAS_STATUS_NOT_SUPPORTED
2022-11-02 14:30:24.704299: E tensorflow/stream_executor/cuda/cuda_blas.cc:2453] Internal: failed BLAS call, see log for details                               
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)                                                            
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)                       
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[16,384,64], b.shape=[16,384,64], m=384, n=384, k=64, batch_size=16 
         [[{{node bert/encoder/layer_0/attention/self/MatMul}}]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[16,384,64], b.shape=[16,384,64], m=384, n=384, k=64, batch_size=16                
         [[{{node bert/encoder/layer_0/attention/self/MatMul}}]]
         [[logits/_11]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 120, in <module>
    main()
  File "run.py", line 102, in main
    lg.StartTestWithLogSettings(sut.sut, sut.qsl.qsl, settings, log_settings)
  File "/workspace/tf_SUT.py", line 64, in issue_queries
    result = self.sess.run(["logits:0"], feed_dict=feeds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[16,384,64], b.shape=[16,384,64], m=384, n=384, k=64, batch_size=16
         [[node bert/encoder/layer_0/attention/self/MatMul (defined at /workspace/tf_SUT.py:45) ]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[16,384,64], b.shape=[16,384,64], m=384, n=384, k=64, batch_size=16
         [[node bert/encoder/layer_0/attention/self/MatMul (defined at /workspace/tf_SUT.py:45) ]]
         [[logits/_11]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'bert/encoder/layer_0/attention/self/MatMul':
  File "run.py", line 120, in <module>
    main()
  File "run.py", line 68, in main
    sut = get_tf_sut(args)
  File "/workspace/tf_SUT.py", line 79, in get_tf_sut
    return BERT_TF_SUT(args)
  File "/workspace/tf_SUT.py", line 45, in __init__
    tf.import_graph_def(graph_def, name='')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/importer.py", line 443, in import_graph_def
    _ProcessNewOps(graph)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/importer.py", line 236, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3751, in _add_new_tf_operations 
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3751, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3641, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

Segmentation fault (core dumped)

Looking into this, I suspected an out of memory condition on my GPU, but I'm using an NVIDIA A30 with 24GB of memory. I would think that's plenty enough. In case it's helpful, I'm running on Ubuntu 20.04 with NVIDIA driver version 520.61.05 and CUDA version 11.8.

coppock avatar Nov 02 '22 19:11 coppock