tftables icon indicating copy to clipboard operation
tftables copied to clipboard

Code hangs on training

Open egaebel opened this issue 6 years ago • 7 comments

I'm reading multiple datasets from a single file and after a certain number of iterations the code hangs indefinitely (I let it go overnight just to be absolutely certain). I have to ctrl+C out of it and I get the following exception. Looks like a hang in multitables somewhere? Maybe from the queue not being populated quickly enough?

Traceback (most recent call last): │·· File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap │·· self.run() │·· File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run │·· self._target(*self._args, **self._kwargs) │·· File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 389, in _Streamer__read_process │·· with sync.do(cbuf.put_direct(), i, (i+read_size) % len(ary)) as put_ary: │·· File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 136, in enter │·· with self.sync.barrier_in.wait(*self.index): │·· File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 87, in enter │·· self.sync.cvar.wait() │·· File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 262, in wait │·· return self._wait_semaphore.acquire(True, timeout)

egaebel avatar Aug 14 '18 05:08 egaebel

I think it occurs when the end of the dataset is reached. I have cyclic=True set on get_batch though...

egaebel avatar Aug 14 '18 06:08 egaebel

It seems to be stuck waiting to write to the internal queue, maybe there is an issue with ordered access? Can you try setting ordered = False, and see if it still hangs? (this will results in corrupted batches if you're splicing multiple data sets to create each training example, but might help narrow things down)

ghcollin avatar Aug 14 '18 22:08 ghcollin

Thanks for your quick response!

This indeed makes the hanging go away, but I am doing some splicing with datasets.

egaebel avatar Aug 15 '18 02:08 egaebel

If you run the multitables unit tests, https://github.com/ghcollin/multitables/blob/master/multitables_test.py do they complete properly? Also what size/how many rows is your dataset(s)?

ghcollin avatar Aug 16 '18 18:08 ghcollin

Hey sorry for the long silence, I ended up changing my dataset to all be in one table and everything is working fine now.

However. The multitables unit test hangs as well. My dataset has 68933 rows.

egaebel avatar Sep 12 '18 13:09 egaebel

So actually, with my one table approach when I set the reader's ordered=False it freezes, but when ordered=True it does not freeze. Very odd...

Any tips on how I can get the multitables unit tests to run? Seems like that is probably the same thing...

egaebel avatar Sep 16 '18 20:09 egaebel

I'm also running into the same issue.

I'm training on multiple datasets, with ordered=True as recommended, however this results in the following error:

Traceback (most recent call last):
  File "train.py", line 225, in <module>
    tf.app.run()
  File "/usr/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "train.py", line 221, in main
    train()
  File "train.py", line 216, in train
    filename)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 552, in begin
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 521, in stop
  File "/usr/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python3.7/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 473, in __read_thread
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 393, in feed
  File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 293, in read_batch
  File "/usr/lib/python3.7/site-packages/multitables-1.1.1-py3.7.egg/multitables.py", line 270, in __enter__
    return self.arys[self.idx]
IndexError: list index out of range

When I run the same code with ordered=False, the code runs as expected but with corrupted batches.

sullivan-sean avatar Apr 09 '19 21:04 sullivan-sean