examples icon indicating copy to clipboard operation
examples copied to clipboard

training will randomly freeze for training AlexNet from scratch.

Open zym1010 opened this issue 7 years ago • 18 comments

sometimes, the training process will simply get stuck at testing.

Epoch: [0][5000/5005]   Time 0.100 (0.335)      Data 0.000 (0.244)      Loss 5.9800 (6.5614)    Prec@1 1.953 (0.735)    Prec@5 7.812 (2.896)
Test: [0/196]   Time 7.905 (7.905)      Loss 4.1344 (4.1344)    Prec@1 16.016 (16.016)  Prec@5 51.562 (51.562)

Or, more frequently, the line Test: [0/196] won't appear and the whole process gets stuck at line Epoch: [0][5000/5005]

it has been like so for several hours, and by looking at top, no processes are using CPU.

I called CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 20 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC 2>&1 | tee alexnet_train.log to train the network.

This appears both on a CentOS 6 machine as well as a Ubuntu 14.04 machine.

zym1010 avatar Apr 20 '17 23:04 zym1010

@apaszke @soumith this is the output after I ctrl+c the program (on a Ubuntu 14.04 machine, with Titan Black and 64GB RAM). Is it anyway related to pytorch/pytorch#1120?

Epoch: [0][5000/5005]   Time 0.339 (0.340)      Data 0.000 (0.001)      Loss 5.6525 (6.5535)    Prec@1 3.125 (0.760)    Prec@5 12.891 (2.980)
^CProcess Process-40:
Process Process-38:
Process Process-39:
Process Process-35:
Process Process-34:
Process Process-24:
Process Process-26:
Process Process-36:
Process Process-37:
Process Process-33:
Process Process-22:
Traceback (most recent call last):
  File "main.py", line 289, in <module>
Process Process-27:
Process Process-29:
Process Process-30:
    main()
  File "main.py", line 134, in main
Process Process-25:
Process Process-28:
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 207, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
Process Process-32:
Process Process-31:
Process Process-23:
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
Traceback (most recent call last):
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
    idx, batch = self.data_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/queue.py", line 164, in get
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
    self.not_empty.wait()
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/threading.py", line 293, in wait
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
KeyboardInterrupt
    waiter.acquire()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 343, in get
    res = self._reader.recv_bytes()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

zym1010 avatar Apr 22 '17 20:04 zym1010

similar issues happen on a CentOS 6 machine


Test: [0/196]   Time 5.107 (5.107)      Loss 5.4296 (5.4296)    Prec@1 5.469 (5.469)    Prec@5 20.703 (20.703)
^CTraceback (most recent call last):
  File "main.py", line 292, in <module>
Process Process-35:
Process Process-38:
Process Process-40:
Process Process-33:
Process Process-39:
Process Process-34:
Process Process-21:
Process Process-36:
Process Process-22:
Process Process-27:
Process Process-30:
Process Process-29:
Process Process-31:
Process Process-26:
Process Process-37:
Process Process-28:
Process Process-32:
    main()
  File "main.py", line 137, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 210, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
KeyboardInterrupt
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
    waiter.acquire()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt

zym1010 avatar Apr 22 '17 21:04 zym1010

another run on CentOS 6 gave the following.

Epoch: [0][5000/5005]   Time 0.157 (0.645)      Data 0.000 (0.483)      Loss 5.8995 (6.6278)    Prec@1 3.906 (0.611)    Prec@5 8.984 (2.423)
C^CTraceback (most recent call last):
  File "main.py", line 292, in <module>
Process Process-26:
Process Process-31:
Process Process-25:
Process Process-33:
Process Process-28:
Process Process-38:
Process Process-36:
Process Process-30:
Process Process-34:
Process Process-24:
Process Process-37:
Process Process-40:
Process Process-32:
Process Process-35:
Process Process-29:
Process Process-27:
Process Process-39:
    main()
  File "main.py", line 137, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 210, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()
KeyboardInterrupt

zym1010 avatar Apr 22 '17 22:04 zym1010

another run on Ubuntu.

Epoch: [0][4980/5005]   Time 0.340 (0.338)      Data 0.000 (0.002)      Loss 6.0021 (6.6848)    Prec@1 2.734 (0.486)    Prec@5 7.031 (2.024)
Epoch: [0][5000/5005]   Time 0.335 (0.338)      Data 0.000 (0.002)      Loss 5.9103 (6.6820)    Prec@1 2.734 (0.493)    Prec@5 9.375 (2.046)
^B1^CProcess Process-44:
Process Process-43:
Process Process-42:
Process Process-41:
Process Process-39:
Process Process-37:
Process Process-28:
Process Process-34:
Process Process-40:
Process Process-32:
Process Process-35:
Process Process-38:
Process Process-33:
Process Process-31:
Traceback (most recent call last):
Process Process-24:
  File "main.py", line 289, in <module>
Process Process-30:
Process Process-36:
    main()
  File "main.py", line 134, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 207, in validate
Process Process-25:
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
Process Process-29:
Process Process-26:
Process Process-27:
    idx, batch = self.data_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/threading.py", line 293, in wait
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
    waiter.acquire()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 343, in get
    res = self._reader.recv_bytes()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

zym1010 avatar Apr 22 '17 23:04 zym1010

another run on Ubuntu.

Epoch: [0][5000/5005]   Time 0.340 (0.339)      Data 0.000 (0.001)      Loss 5.9935 (6.6413)    Prec@1 1.953 (0.572)    Prec@5 10.156 (2.335)
^C^CProcess Process-44:
Process Process-42:
Process Process-38:
Traceback (most recent call last):
  File "main.py", line 289, in <module>
Process Process-41:
Process Process-32:
Traceback (most recent call last):
Process Process-39:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
    main()
  File "main.py", line 134, in main
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
Process Process-33:
Process Process-24:
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 207, in validate
Traceback (most recent call last):
Process Process-27:
Process Process-40:
Process Process-28:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
KeyboardInterrupt
    idx, batch = self.data_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/queue.py", line 164, in get
Process Process-35:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
Process Process-26:
Process Process-25:
KeyboardInterrupt
Process Process-34:
Process Process-29:
    self.not_empty.wait()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/threading.py", line 293, in wait
Process Process-36:
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
    waiter.acquire()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
Traceback (most recent call last):
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Process Process-43:
Process Process-31:
Traceback (most recent call last):
Process Process-37:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
Process Process-30:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh_everyday/miniconda2/envs/pytorch_openblas/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

zym1010 avatar Apr 23 '17 00:04 zym1010

another run on CentOS.

Epoch: [0][4980/5005]   Time 0.159 (1.079)      Data 0.000 (0.902)      Loss 5.9452 (6.6232)    Prec@1 4.297 (0.639)    Prec@5 8.984 (2.517)
Epoch: [0][5000/5005]   Time 0.157 (1.076)      Data 0.000 (0.899)      Loss 5.9369 (6.6202)    Prec@1 1.172 (0.646)    Prec@5 8.984 (2.546)
^CTraceback (most recent call last):
  File "main.py", line 292, in <module>
Process Process-31:
Process Process-42:
Process Process-33:
Process Process-25:
Process Process-26:
Process Process-44:
Process Process-41:
Process Process-43:
Process Process-39:
Process Process-29:
Process Process-32:
Process Process-27:
Process Process-34:
Process Process-35:
Process Process-30:
Process Process-36:
Process Process-37:
Process Process-40:
Process Process-28:
Process Process-38:
    main()
  File "main.py", line 137, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 210, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
Traceback (most recent call last):
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 343, in get
    res = self._reader.recv_bytes()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
    waiter.acquire()
KeyboardInterrupt

zym1010 avatar Apr 23 '17 04:04 zym1010

another run from CentOS (previously, all those CentOS runs were done using Maxwell Titan X; this one was done using Pascal).

Epoch: [0][4960/5005]   Time 0.111 (0.368)      Data 0.001 (0.274)      Loss 6.0431 (6.6413)    Prec@1 2.344 (0.549)       Prec@5 8.203 (2.242)
Epoch: [0][4980/5005]   Time 0.100 (0.367)      Data 0.000 (0.274)      Loss 5.9902 (6.6386)    Prec@1 1.562 (0.556)       Prec@5 5.469 (2.266)
Epoch: [0][5000/5005]   Time 0.100 (0.368)      Data 0.000 (0.275)      Loss 6.1395 (6.6359)    Prec@1 2.734 (0.563)       Prec@5 6.641 (2.289)
Test: [0/196]   Time 7.893 (7.893)      Loss 5.1316 (5.1316)    Prec@1 7.031 (7.031)    Prec@5 30.078 (30.078)
^CTraceback (most recent call last):
  File "main.py", line 292, in <module>
Process Process-37:
Process Process-23:
Process Process-24:
Process Process-39:
Process Process-25:
Process Process-36:
Process Process-40:
Process Process-41:
Process Process-42:
Process Process-43:
Process Process-35:
Process Process-44:
Process Process-38:
    main()
  File "main.py", line 137, in main
    prec1 = validate(val_loader, model, criterion)
  File "main.py", line 210, in validate
    for i, (input, target) in enumerate(val_loader):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 168, in __next__
    idx, batch = self.data_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/queue.py", line 164, in get
    self.not_empty.wait()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/threading.py", line 293, in wait
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
KeyboardInterrupt
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 342, in get
    with self._rlock:
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/queues.py", line 343, in get
    res = self._reader.recv_bytes()
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/yimengzh/miniconda2/envs/pytorch/lib/python3.5/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
    waiter.acquire()
KeyboardInterrupt

zym1010 avatar Apr 23 '17 16:04 zym1010

@zym1010

Have you found a solution to this, except for the pin_memory=False setting? It doesn't work for me.

filick avatar Oct 17 '17 02:10 filick

@filick nope.

zym1010 avatar Oct 17 '17 02:10 zym1010

Sad.

filick avatar Oct 17 '17 02:10 filick

We have a similar problem with training locking up on a CentOS system with 4 Pascal Titan Xs in an Ubuntu docker container. We can exec into the docker container, but can't kill the process.

We have not seen this on systems using Ubuntu 16.

gregjohnso avatar Nov 15 '17 03:11 gregjohnso

Hit me this week. On Ubuntu 16 machine everything works fine, but in a docker container, it freezes randomly. Once it also completed successfully.

umariqb avatar Dec 06 '17 23:12 umariqb

Same issue sometimes occured on my Ubuntu 16.04, when training other networks, the training process just got stuck at Epoch: [0].

jamiechoi1995 avatar Dec 07 '17 05:12 jamiechoi1995

@iqbalu Any solution to this? I am getting this problem when the input data size is huge and the num_workers>0 It happens when using docker in Ubuntu systems

rohun-tripathi avatar Jan 03 '18 20:01 rohun-tripathi

@rohun-tripathi no I am still struggling to find the exact problem. For me, it also gets stuck when using docker but works fine on my local machine. Additionally, I found that with nvidia-docker2 it works fine, but gets stuck using nvidia-docker1. So this can also be something related to nvidia-docker. Which version of nvidia-docker are you using?

umariqb avatar Jan 03 '18 21:01 umariqb

Having similar issue, gets stuck at epoch 0, running in a docker container on a P2 Amazon linux AMI with cuda 8.

aysark avatar Jan 16 '18 00:01 aysark

@iqbalu I don't I am using nvidia-docker at all. My system does have nvidia-docker1 installed

rohun-tripathi avatar Jan 29 '18 00:01 rohun-tripathi

I have a similar problem. The code runs well on 2 GPU, but when i run the code on 4 gpu, it freezes at the begining. Then i upgrade my pytorch from version 0.3.1 to 0.4.1, it can run for a few iteration but it stalls again and the process is sleeping. I degrade pytorch to 0.3.1 and compare to the code last successful running on 4 GPU. The reason is that I use a mediate model( mediate_out = modelA(input), out = modelB(mediate_out), and after merging the two models, it works.

hjy1312 avatar Dec 30 '18 12:12 hjy1312