tensor2tensor icon indicating copy to clipboard operation
tensor2tensor copied to clipboard

Stuck after printing 'Successfully opened dynamic library libcublas.so.10.0'

Open zhez6 opened this issue 6 years ago • 45 comments
trafficstars

Description

I run this command 't2t-trainer --problem=librispeech --model=transformer --data_dir=~/dataset/t2t/librispeech/ --output_dir=. --hparams_set=transformer_librispeech --worker_gpu=1' and it's stuck after printing 'Successfully opened dynamic library libcublas.so.10.0'. Then I set TF_CPP_MIN_VLOG_LEVEL=2, it keeps printing 'tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0 tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 20480 ms. tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0 tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 40960 ms. tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0 tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms. tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0 tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.'

...

Environment information

OS: <your answer here>

$ pip freeze | grep tensor
# your output here
mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow-datasets==1.0.2
tensorflow-estimator==1.14.0
tensorflow-gpu==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0
$ python -V
# your output here
Python 3.6.5 :: Anaconda, Inc.

For bugs: reproduction and error logs

# Steps to reproduce:
# Error logs:
See descriptions

zhez6 avatar Jul 24 '19 16:07 zhez6

Description

I am having the same issue for both tensorflow version 1.14.0 and tensorflow version 1.14.1 that were built using CUDA 10.1.

Environment information

OS: Ubuntu 16.04.6 LTS

$ pip freeze | grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0


$ python -V
Python 2.7.12

For bugs: reproduction and error logs

# Steps to reproduce:
t2t-trainer --problem=librispeech --model=transformer --data_dir=~/datasets/t2t/librispeech/ --output_dir=~/trainoutput/librispeech/ --hparams_set=transformer_librispeech --worker_gpu=1
# Error logs:
[...]
session_manager.py:500] Running local_init_op.
session_manager.py:502] Done running local_init_op.
basic_session_run_hooks.py:606] Saving checkpoints for 0 into ~/trainoutput/librispeech/model.ckpt.
tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 20480 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 40960 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.
tensorflow/core/framework/model.cc:440] Starting optimization of tunable parameters
tensorflow/core/framework/model.cc:482] Number of tunable parameters: 0
tensorflow/core/kernels/data/model_dataset_op.cc:172] Waiting for 60000 ms.

cantwbr avatar Aug 02 '19 12:08 cantwbr

I stuck at the same step when trying running LibriSpeechCleanSmall+Transformer. And I can't receive any log like 'Starting optimization of tunable parameters'. It just gets stuck, stop logging and keep occupying GPU without unexpected termination.

Enviroment:

OS: Ubuntu 16.04.5 LTS

$ pip freeze | grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.14.0
tensorflow-datasets==1.1.0
tensorflow-estimator==1.14.0rc1
tensorflow-gpu==1.14.0rc1
tensorflow-hub==0.4.0
tensorflow-metadata==0.14.0
tensorflow-probability==0.7.0

$ python -V
Python 3.5.2

$ cat /usr/local/cuda/version.txt
CUDA Version 10.0.130

For bugs: reproduction and error logs

# Steps to reproduce:
t2t-trainer --problem=librispeech_clean_small --model=transformer --hparams_set=transformer_librispeech --data_dir=/data/tensor2tensor/data/librispeech_clean_small --output_dir=/data/tensor2tensor/exp --train_steps=1000 --eval_steps=100 --verbosity=0
# Error logs:
[...]
2019-08-06 06:28:42.670483: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-06 06:28:43.090613: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-06 06:28:43.266243: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-08-06 06:28:43.328620: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-08-06 06:28:43.830164: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-08-06 06:28:44.081212: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-08-06 06:28:44.819371: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-06 06:28:44.826114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-08-06 06:28:44.834796: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-06 06:28:44.839837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-06 06:28:44.839860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-08-06 06:28:44.839867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-08-06 06:28:44.854920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11596 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:8c:00.0, compute capability: 3.7)
2019-08-06 06:28:49.295630: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0806 06:28:56.309820 139654260696832 session_manager.py:500] Running local_init_op.
I0806 06:28:56.744457 139654260696832 session_manager.py:502] Done running local_init_op.
I0806 06:30:06.414134 139654260696832 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /data/tensor2tensor/exp/librispeech_transformer_clean_small/model.ckpt.
2019-08-06 06:30:50.528524: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

tenghaha avatar Aug 06 '19 07:08 tenghaha

Did anyone find a fix for this? Getting the exact same problem using t2t-decoder, depending on the model, it either hangs on Successfully opened dynamic library libcublas.so.10.0 or Successfully opened dynamic library libcudnn.so.7. This was not happening for me 2-3 weeks ago - I'm not sure what's changed

amin-nejad avatar Aug 21 '19 18:08 amin-nejad

Description

I also have the same issue. After "Successfully opened dynamic library libcublas.so.10.0", nothing even after 3 days.

Environment

OS: Ubuntu 18.04.2 LTS

mesh-tensorflow==0.0.5 tensor2tensor==1.14.0 tensorboard==1.14.0 tensorflow-datasets==1.2.0 tensorflow-estimator==1.14.0 tensorflow-gan==1.0.0.dev0 tensorflow-gpu==1.14.0 tensorflow-metadata==0.14.0 tensorflow-probability==0.7.0

Python 3.6.8

CUDA Version 10.0.130

For bugs: reproduction and error logs

# Steps to reproduct:
t2t-trainer --worker_gpu=4 --model=transformer --hparams="batch_size=32" --hparams_set=transformer_librispeech_v1 --problem=librispeech_clean_small --train_steps=100000 --eval_steps=100 --local_eval_frequency=1000 --data_dir=/home/Librispeech/data --output_dir=/tmp/t2t.work/librispeech_clean_small.20190823
# Error logs:
[...]
2019-08-23 03:15:28.608309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3
2019-08-23 03:15:28.608342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y N N
2019-08-23 03:15:28.608361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N N N
2019-08-23 03:15:28.608391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   N N N Y
2019-08-23 03:15:28.608439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   N N Y N
2019-08-23 03:15:28.614141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10619 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-08-23 03:15:28.615641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10619 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-08-23 03:15:28.617012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10619 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-08-23 03:15:28.618385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10619 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
2019-08-23 03:15:31.222250: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0823 03:15:32.888142 140151858374464 session_manager.py:500] Running local_init_op.
I0823 03:15:33.431387 140151858374464 session_manager.py:502] Done running local_init_op.
I0823 03:15:55.611809 140151858374464 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/t2t.work/librispeech_clean_small.20190823/model.ckpt.
2019-08-23 03:16:33.482639: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

AaronSeunghi avatar Aug 26 '19 00:08 AaronSeunghi

Same here, then I get a crash with these logs :

runtime/cgo: pthread_create failed: Resource temporarily unavailable
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f4c5672ee97 m=14 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7f4c5672ee97
stack: frame={sp:0x7f4c237fd800, fp:0x0} stack=[0x7f4c22ffe290,0x7f4c237fde90)
00007f4c237fd700:  0000000000000000  0000000000000000
00007f4c237fd710:  0000000000000000  0000000000000000
00007f4c237fd720:  0000000000000000  0000000000000000
00007f4c237fd730:  0000000000000000  0000000000000000
00007f4c237fd740:  0000000000000000  0000000000000000
00007f4c237fd750:  0000000000000000  0000000000000000
00007f4c237fd760:  0000000000000000  0000000000000000
00007f4c237fd770:  0000000000000000  0000000000000000
00007f4c237fd780:  0000000000000000  0000000000000000
00007f4c237fd790:  0000000000000000  0000000000000000
00007f4c237fd7a0:  0000000000000000  0000000000000000
00007f4c237fd7b0:  0000000000000000  0000000000000000
00007f4c237fd7c0:  0000000000000000  0000000000000000
00007f4c237fd7d0:  0000000000000000  0000000000000000
00007f4c237fd7e0:  0000000000000000  0000000000000000
00007f4c237fd7f0:  0000000000000000  0000000000000000
00007f4c237fd800: <0000000000000000  0000000000000000
00007f4c237fd810:  0000000000000000  0000000000000000
00007f4c237fd820:  0000000000000000  0000000000000000
00007f4c237fd830:  0000000000000000  0000000000000000
00007f4c237fd840:  0000000000000000  0000000000000000
00007f4c237fd850:  0000000000000000  0000000000000000
00007f4c237fd860:  0000000000000000  0000000000000000
00007f4c237fd870:  0000000000000000  0000000000000000
00007f4c237fd880:  fffffffe7fffffff  ffffffffffffffff
00007f4c237fd890:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8a0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8b0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8c0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8d0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8e0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8f0:  ffffffffffffffff  ffffffffffffffff
runtime: unknown pc 0x7f4c5672ee97
stack: frame={sp:0x7f4c237fd800, fp:0x0} stack=[0x7f4c22ffe290,0x7f4c237fde90)
00007f4c237fd700:  0000000000000000  0000000000000000
00007f4c237fd710:  0000000000000000  0000000000000000
00007f4c237fd720:  0000000000000000  0000000000000000
00007f4c237fd730:  0000000000000000  0000000000000000
00007f4c237fd740:  0000000000000000  0000000000000000
00007f4c237fd750:  0000000000000000  0000000000000000
00007f4c237fd760:  0000000000000000  0000000000000000
00007f4c237fd770:  0000000000000000  0000000000000000
00007f4c237fd780:  0000000000000000  0000000000000000
00007f4c237fd790:  0000000000000000  0000000000000000
00007f4c237fd7a0:  0000000000000000  0000000000000000
00007f4c237fd7b0:  0000000000000000  0000000000000000
00007f4c237fd7c0:  0000000000000000  0000000000000000
00007f4c237fd7d0:  0000000000000000  0000000000000000
00007f4c237fd7e0:  0000000000000000  0000000000000000
00007f4c237fd7f0:  0000000000000000  0000000000000000
00007f4c237fd800: <0000000000000000  0000000000000000
00007f4c237fd810:  0000000000000000  0000000000000000
00007f4c237fd820:  0000000000000000  0000000000000000
00007f4c237fd830:  0000000000000000  0000000000000000
00007f4c237fd840:  0000000000000000  0000000000000000
00007f4c237fd850:  0000000000000000  0000000000000000
00007f4c237fd860:  0000000000000000  0000000000000000
00007f4c237fd870:  0000000000000000  0000000000000000
00007f4c237fd880:  fffffffe7fffffff  ffffffffffffffff
00007f4c237fd890:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8a0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8b0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8c0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8d0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8e0:  ffffffffffffffff  ffffffffffffffff
00007f4c237fd8f0:  ffffffffffffffff  ffffffffffffffff

goroutine 1 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f4c57084f00, 0x72, 0xc4205e7550)
	/usr/local/go/src/runtime/netpoll.go:173 +0x59
internal/poll.(*pollDesc).wait(0xc42020c118, 0x72, 0xffffffffffffff00, 0x560ca4e9aaa0, 0x560ca592daa8)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9d
internal/poll.(*pollDesc).waitRead(0xc42020c118, 0xc4204ed000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc42020c100, 0xc4204ed000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:157 +0x17f
net.(*netFD).Read(0xc42020c100, 0xc4204ed000, 0x1000, 0x1000, 0x9c5, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc420394878, 0xc4204ed000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:176 +0x6c
net/http.(*persistConn).Read(0xc420288fc0, 0xc4204ed000, 0x1000, 0x1000, 0xc420618050, 0xc4204ed9c3, 0x2)
	/usr/local/go/src/net/http/transport.go:1453 +0x138
bufio.(*Reader).fill(0xc420388c00)
	/usr/local/go/src/bufio/bufio.go:100 +0x120
bufio.(*Reader).ReadSlice(0xc420388c00, 0xc42000010a, 0x300000002, 0xc420000180, 0xc4205e77c0, 0x560ca31cf92b, 0xc420000180)
	/usr/local/go/src/bufio/bufio.go:341 +0x2e
net/http/internal.readChunkLine(0xc420388c00, 0x1, 0x3, 0xc420050a70, 0xc420050a00, 0xc4200aa0d8)
	/usr/local/go/src/net/http/internal/chunked.go:122 +0x36
net/http/internal.(*chunkedReader).beginChunk(0xc420618030)
	/usr/local/go/src/net/http/internal/chunked.go:48 +0x34
net/http/internal.(*chunkedReader).Read(0xc420618030, 0xc420652000, 0x8009, 0x8009, 0xc4205e78e0, 0x560ca31c56db, 0x560c00000008)
	/usr/local/go/src/net/http/internal/chunked.go:93 +0x115
net/http.(*body).readLocked(0xc42061e000, 0xc420652000, 0x8009, 0x8009, 0x0, 0x0, 0xc4205e79a0)
	/usr/local/go/src/net/http/transfer.go:778 +0x63
net/http.(*body).Read(0xc42061e000, 0xc420652000, 0x8009, 0x8009, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/transfer.go:770 +0xdf
net/http.(*bodyEOFSignal).Read(0xc42061e040, 0xc420652000, 0x8009, 0x8009, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/transport.go:2187 +0xde
github.com/docker/cli/vendor/github.com/docker/docker/pkg/stdcopy.StdCopy(0x560ca4e96240, 0xc4204371d0, 0x560ca4e98760, 0xc42000e020, 0x560ca4e984c0, 0xc42061e040, 0x560ca59b9c20, 0x0, 0x0)
	/go/src/github.com/docker/cli/vendor/github.com/docker/docker/pkg/stdcopy/stdcopy.go:108 +0xe2
github.com/docker/cli/cli/command/container.runLogs(0x560ca4ed7820, 0xc4203abb00, 0xc42003bef0, 0x0, 0x0)
	/go/src/github.com/docker/cli/cli/command/container/logs.go:77 +0x442
github.com/docker/cli/cli/command/container.NewLogsCommand.func1(0xc42040d680, 0xc4201fa900, 0x1, 0x2, 0x0, 0x0)
	/go/src/github.com/docker/cli/cli/command/container/logs.go:35 +0x6e
github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).execute(0xc42040d680, 0xc42003a170, 0x2, 0x2, 0xc42040d680, 0xc42003a170)
	/go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:762 +0x46a
github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4203b1680, 0xc42026bfb0, 0x560ca4b916c0, 0xc42026bfc0)
	/go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:852 +0x30c
github.com/docker/cli/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4203b1680, 0xc4203b1680, 0x560ca4e98760)
	/go/src/github.com/docker/cli/vendor/github.com/spf13/cobra/command.go:800 +0x2d
main.main()
	/go/src/github.com/docker/cli/cmd/docker/docker.go:180 +0xde

goroutine 5 [syscall, 2 minutes]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:139 +0xa8
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:22 +0x24
created by os/signal.init.0
	/usr/local/go/src/os/signal/signal_unix.go:28 +0x43

goroutine 40 [chan receive, 2 minutes]:
github.com/docker/cli/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x560ca599b2e0)
	/go/src/github.com/docker/cli/vendor/github.com/golang/glog/glog.go:882 +0x8d
created by github.com/docker/cli/vendor/github.com/golang/glog.init.0
	/go/src/github.com/docker/cli/vendor/github.com/golang/glog/glog.go:410 +0x205

goroutine 15 [select, 2 minutes]:
net/http.(*persistConn).readLoop(0xc420288fc0)
	/usr/local/go/src/net/http/transport.go:1717 +0x745
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1237 +0x95c

goroutine 16 [select, 2 minutes]:
net/http.(*persistConn).writeLoop(0xc420288fc0)
	/usr/local/go/src/net/http/transport.go:1822 +0x14d
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1238 +0x981

goroutine 43 [IO wait, 2 minutes]:
internal/poll.runtime_pollWait(0x7f4c57084e30, 0x72, 0xc4200889a8)
	/usr/local/go/src/runtime/netpoll.go:173 +0x59
internal/poll.(*pollDesc).wait(0xc420622198, 0x72, 0xffffffffffffff00, 0x560ca4e9aaa0, 0x560ca592daa8)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9d
internal/poll.(*pollDesc).waitRead(0xc420622198, 0xc420626000, 0x1000, 0x1000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3f
internal/poll.(*FD).Read(0xc420622180, 0xc420626000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:157 +0x17f
net.(*netFD).Read(0xc420622180, 0xc420626000, 0x1000, 0x1000, 0x560ca31efc10, 0xc420000180, 0x4)
	/usr/local/go/src/net/fd_unix.go:202 +0x51
net.(*conn).Read(0xc42000e048, 0xc420626000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:176 +0x6c
net/http.(*persistConn).Read(0xc42028ad80, 0xc420626000, 0x1000, 0x1000, 0xc420088b98, 0x560ca319fde5, 0xc4203ca360)
	/usr/local/go/src/net/http/transport.go:1453 +0x138
bufio.(*Reader).fill(0xc420616300)
	/usr/local/go/src/bufio/bufio.go:100 +0x120
bufio.(*Reader).Peek(0xc420616300, 0x1, 0x0, 0x0, 0x0, 0xc4203ca2a0, 0x0)
	/usr/local/go/src/bufio/bufio.go:132 +0x3c
net/http.(*persistConn).readLoop(0xc42028ad80)
	/usr/local/go/src/net/http/transport.go:1601 +0x187
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1237 +0x95c

goroutine 44 [select, 2 minutes]:
net/http.(*persistConn).writeLoop(0xc42028ad80)
	/usr/local/go/src/net/http/transport.go:1822 +0x14d
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1238 +0x981

rax    0x0
rbx    0x7f4c56adc840
rcx    0x7f4c5672ee97
rdx    0x0
rdi    0x2
rsi    0x7f4c237fd800
rbp    0x560ca464f220
rsp    0x7f4c237fd800
r8     0x0
r9     0x7f4c237fd800
r10    0x8
r11    0x246
r12    0x560ca629a1b0
r13    0xf1
r14    0x11
r15    0x0
rip    0x7f4c5672ee97
rflags 0x246
cs     0x33
fs     0x0
gs     0x0```

clementbmn avatar Aug 26 '19 09:08 clementbmn

I think the best idea is to report on the TF and google colab lists as this does not look like an error specific to T2T.

lukaszkaiser avatar Aug 27 '19 00:08 lukaszkaiser

Continued here: tensorflow/tensorflow#32017

cantwbr avatar Aug 28 '19 12:08 cantwbr

@rachellim at TF was able to reproduce and resolve the hang issue (parallel_interleave_dataset_op.cc doesn't handle iterator creation errors correctly when the sloppy=True).

With her fix in place (or setting sloppy=False) the training now halts with Conv2D errors.

Other interesting clues (reported by @huang-haijie ) are setting audio_add_delta_deltas from True to False, OR seting audio_preproc_in_bottom from False to True prevents the Conv2D error.

It still seems like it may be a TF issue, as the same T2T code works fine with TF 1.13.2, but fails with the Conv2D issues on TF 1.14.0. Any suggestions for next steps would be appreciated...

mschonwe avatar Sep 19 '19 17:09 mschonwe

Continued here: tensorflow/tensorflow#32691

mschonwe avatar Sep 21 '19 00:09 mschonwe

Description

I also have the same issue. After "Successfully opened dynamic library libcublas.so.10.0", nothing even after 3 days.

Environment

OS: Ubuntu 18.04.2 LTS

mesh-tensorflow==0.0.5 tensor2tensor==1.14.0 tensorboard==1.14.0 tensorflow-datasets==1.2.0 tensorflow-estimator==1.14.0 tensorflow-gan==1.0.0.dev0 tensorflow-gpu==1.14.0 tensorflow-metadata==0.14.0 tensorflow-probability==0.7.0

Python 3.6.8

CUDA Version 10.0.130

For bugs: reproduction and error logs

# Steps to reproduct:
t2t-trainer --worker_gpu=4 --model=transformer --hparams="batch_size=32" --hparams_set=transformer_librispeech_v1 --problem=librispeech_clean_small --train_steps=100000 --eval_steps=100 --local_eval_frequency=1000 --data_dir=/home/Librispeech/data --output_dir=/tmp/t2t.work/librispeech_clean_small.20190823
# Error logs:
[...]
2019-08-23 03:15:28.608309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3
2019-08-23 03:15:28.608342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y N N
2019-08-23 03:15:28.608361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N N N
2019-08-23 03:15:28.608391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   N N N Y
2019-08-23 03:15:28.608439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   N N Y N
2019-08-23 03:15:28.614141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10619 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
2019-08-23 03:15:28.615641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10619 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1)
2019-08-23 03:15:28.617012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10619 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-08-23 03:15:28.618385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10619 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:84:00.0, compute capability: 6.1)
2019-08-23 03:15:31.222250: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
I0823 03:15:32.888142 140151858374464 session_manager.py:500] Running local_init_op.
I0823 03:15:33.431387 140151858374464 session_manager.py:502] Done running local_init_op.
I0823 03:15:55.611809 140151858374464 basic_session_run_hooks.py:606] Saving checkpoints for 0 into /tmp/t2t.work/librispeech_clean_small.20190823/model.ckpt.
2019-08-23 03:16:33.482639: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0

Did the issue resolve? Running in Google Colab with GPU with small dataset ~3GB (basic speech commands)

Arvindia avatar Feb 22 '20 11:02 Arvindia

The same happens with tf.nn.conv3d used inside a map(...) on a tf.data.Dataset. Any updates on how to solve this?

ramonemiliani93 avatar Aug 08 '20 03:08 ramonemiliani93

@ramonemiliani93 , what version of tensorflow are you running? I was not able to reproduce this issue with the following dataset:

def make_tensor(sizes):
  return np.asarray([f * 1.0 for f in range(1, np.prod(sizes) + 1)]).reshape(sizes)

filter = make_tensor([1, 1, 1, 3, 3])
x = make_tensor([10, 2, 3, 1, 3])
dataset = tf.data.Dataset.from_tensors((x, filter))
dataset = dataset.map(lambda input, filter: tf.nn.conv3d(input, filter, strides=[1, 1, 1, 1, 1], padding="VALID")
print(list(dataset))

So it doesn't seem to be an issue with using tf.nn.conv3d inside map. Can you provide a minimal repro?

rachellim avatar Aug 10 '20 21:08 rachellim

Based on https://github.com/tensorflow/tensorflow/issues/38100 and https://github.com/f90/FactorGAN/issues/1, I suspect this may be a problem with your CUDA installation.

rachellim avatar Aug 10 '20 21:08 rachellim

+1. Same issue. It used to work and died without warning on EC2. Later when I reload the pretrained model it hangs with no output.

harishkashyap avatar Sep 15 '20 17:09 harishkashyap

@harishkashyap - what version of tensorflow as you using? If you use an older version, does it still work? (Trying to diagnose whether it's an issue with your CUDA installation or a regression in TF)

rachellim avatar Sep 15 '20 19:09 rachellim

EC2 pytorch AMI tensorflow 2.3

harishkashyap avatar Sep 15 '20 19:09 harishkashyap

No idea. I just used an AMI instance with linux 2 preinstalled with pytorch. Was working fine and now fails to load the pre-trained model.

harishkashyap avatar Sep 15 '20 20:09 harishkashyap

@sanjoy, can you reassign this to someone on the GPU team to investigate?

rachellim avatar Sep 16 '20 17:09 rachellim

Same issue.

skaldek avatar Sep 24 '20 20:09 skaldek

I'm trying to run tf object detection model. Getting the same issue; Stuck after Successfully opened dynamic library libcuda.so.1

Please someone helps.

AzinPoshtyar avatar Nov 03 '20 00:11 AzinPoshtyar

I have this problem using the anaconda cudatoolkit. I ended up using nvidia-docker instead for my cuda/cudnn installation and now it works.

lminer avatar Nov 03 '20 00:11 lminer

same issue on Ubuntu 20.04 with RTX3090. TF was installed using anaconda.

flint-xf-fan avatar Apr 29 '21 08:04 flint-xf-fan

Same issue with Ubuntu 20.04.

Python version:

Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)

Tensorflow version:

❯ conda list | grep tensorflow
tensorflow                2.4.1           gpu_py39h8236f22_0  
tensorflow-base           2.4.1           gpu_py39h29c2da4_0  
tensorflow-estimator      2.4.1              pyheb71bc4_0  
tensorflow-gpu            2.4.1                h30adc30_0 
❯ python
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf
2021-04-30 14:32:10.400682: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

>>> tf.add(1, 2)
2021-04-30 14:32:26.991352: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-30 14:32:26.993581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-30 14:32:27.026516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.027085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 74.65GiB/s
2021-04-30 14:32:27.027104: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-04-30 14:32:27.028731: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-04-30 14:32:27.028771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-04-30 14:32:27.030183: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-04-30 14:32:27.030438: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-04-30 14:32:27.032093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-04-30 14:32:27.033044: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-04-30 14:32:27.036642: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-04-30 14:32:27.036780: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.037699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.038214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-30 14:32:27.038509: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-30 14:32:27.038916: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.039462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 74.65GiB/s
2021-04-30 14:32:27.039482: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-04-30 14:32:27.039507: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-04-30 14:32:27.039522: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-04-30 14:32:27.039536: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-04-30 14:32:27.039550: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-04-30 14:32:27.039563: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-04-30 14:32:27.039577: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-04-30 14:32:27.039590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-04-30 14:32:27.039645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.040215: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-30 14:32:27.040651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-04-30 14:32:27.040677: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

reza-ebrahimi avatar Apr 30 '21 11:04 reza-ebrahimi

same issue on Ubuntu 20.04 with RTX3090. TF was installed using anaconda.

Same for me

ORippler avatar May 18 '21 10:05 ORippler

same issue with CUDA=10.0 CUDNN=7.4 Tensorflow=1.14.0. Object detection algo stuck at: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10

and sometimes stuck at: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7

Anyone figue it out? REALLY NEED HELP!

alanzyt311 avatar Jun 04 '21 08:06 alanzyt311

For such cases, it'd be useful to get a stacktrace of where TF is stuck.

You can obtain this using gdb: start gdb, attach to the hung TF process, and then get a backtrace.

sanjoy avatar Jun 04 '21 15:06 sanjoy

Hi @sanjoy , thx for your advice.

I've tried to find out the process id of running python program and gdb attach PID; then use bt to get backtrace. However, it returns No stack.

Then I tried ps -ef | grep tensorflow-gpu | grep -v grep and this line returns nothing. I am wondering does it mean such a problem has nothing to do with tensorflow?

I've also tried with PyTorch on the same machine, and it reflects a similar situation that it also takes a long time to load some libraries from Cuda.

Below are the details of my situation:

GPU: GeForce RTX 3060 Driver Version: 460.73.01 CUDA Driver Veresion: 11.2

Tensorflow: tensorflow-gpu 1.14.0 CUDA Runtime Version: 10.0 cudnn: 7.4.1 (CUDA Runtime and cudnn version fits the guide from Tensorflow official documentation)

I've tried the following test about TensorFlow and it all works: tf.test.is_built_with_cuda() and tf.test.is_gpu_available()

My situation is the program will stuck at 2021-06-05 12:16:54.099778: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10 for several minutes. and sometimes stuck at another loading process 2021-06-05 12:21:22.212818: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 for even longer time. You can check the attached for details. log.txt

After waiting for around 30 mins, the program will continue running and WORK WELL. So the MAJOR PROBLEM is that it takes a long time in loading libraries related to cuda (I guess). And I don't know how to locate the problem and resolve it.

alanzyt311 avatar Jun 05 '21 04:06 alanzyt311

@alanzyt311 I missed that you're running TF 1.14. 1.14 is very old and does not have native support for your GPU (which I believe is Ampere based). So TensorFlow blocks at startup as it JIT compiles PTX to SASS for your GPU which can take 30+ minutes.

Can you please try running with TF 2.5?

sanjoy avatar Jun 05 '21 05:06 sanjoy

Thx. I've tried TF-gpu 2.0 just now, still not worked (same problem as above). Now gonna try for TF 2.5.

alanzyt311 avatar Jun 05 '21 07:06 alanzyt311

Same problem, I have to wait more than 15 mins to see my training. And it throws loss==Nan.

I have to quit training and check files for another training. Same thing occurs again. Been two days stuck here, 1) takes around 10 ~ 15 mins to start training 2) Throws loss=Nan.

Any help?

I also decreased lr_rate, batch size, nothing works.

I am using docker with following details

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
 0  GeForce RTX 3060    Off  | 00000000:0B:00.0  On |                  

N/A |
|  0%   60C    P2    41W / 180W |  11391MiB / 12031MiB

This is my output while waiting for training to run:

021-06-10 07:48:07.953350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2021-06-10 07:48:07.953360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2021-06-10 07:48:07.953478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-10 07:48:07.953996: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-10 07:48:07.954461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10923 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3060, pci bus id: 0000:0b:00.0, compute capability: 8.6)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0610 07:48:07.957259 139983451281216 mirrored_strategy.py:500] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0610 07:48:07.961450 139983451281216 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0610 07:48:07.961535 139983451281216 config_util.py:552] Maybe overwriting use_bfloat16: False

Laudarisd avatar Jun 10 '21 07:06 Laudarisd