tftables
tftables copied to clipboard
Code hangs on training
I'm reading multiple datasets from a single file and after a certain number of iterations the code hangs indefinitely (I let it go overnight just to be absolutely certain). I have to ctrl+C out of it and I get the following exception. Looks like a hang in multitables somewhere? Maybe from the queue not being populated quickly enough?
Traceback (most recent call last): │·· File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap │·· self.run() │·· File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run │·· self._target(*self._args, **self._kwargs) │·· File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 389, in _Streamer__read_process │·· with sync.do(cbuf.put_direct(), i, (i+read_size) % len(ary)) as put_ary: │·· File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 136, in enter │·· with self.sync.barrier_in.wait(*self.index): │·· File "/workspace-mount/Programs/tachotron2-implementations/barronalex-tachotron/lib/python3.5/site-packages/multitables.py", line 87, in enter │·· self.sync.cvar.wait() │·· File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 262, in wait │·· return self._wait_semaphore.acquire(True, timeout)
I think it occurs when the end of the dataset is reached. I have cyclic=True set on get_batch though...
It seems to be stuck waiting to write to the internal queue, maybe there is an issue with ordered access? Can you try setting ordered = False
, and see if it still hangs? (this will results in corrupted batches if you're splicing multiple data sets to create each training example, but might help narrow things down)
Thanks for your quick response!
This indeed makes the hanging go away, but I am doing some splicing with datasets.
If you run the multitables unit tests, https://github.com/ghcollin/multitables/blob/master/multitables_test.py
do they complete properly? Also what size/how many rows is your dataset(s)?
Hey sorry for the long silence, I ended up changing my dataset to all be in one table and everything is working fine now.
However. The multitables unit test hangs as well. My dataset has 68933 rows.
So actually, with my one table approach when I set the reader's ordered=False it freezes, but when ordered=True it does not freeze. Very odd...
Any tips on how I can get the multitables unit tests to run? Seems like that is probably the same thing...
I'm also running into the same issue.
I'm training on multiple datasets, with ordered=True
as recommended, however this results in the following error:
Traceback (most recent call last):
File "train.py", line 225, in <module>
tf.app.run()
File "/usr/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "train.py", line 221, in main
train()
File "train.py", line 216, in train
filename)
File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 552, in begin
File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 521, in stop
File "/usr/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/lib/python3.7/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 473, in __read_thread
File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 393, in feed
File "/usr/lib/python3.7/site-packages/tftables-1.1.2-py3.7.egg/tftables.py", line 293, in read_batch
File "/usr/lib/python3.7/site-packages/multitables-1.1.1-py3.7.egg/multitables.py", line 270, in __enter__
return self.arys[self.idx]
IndexError: list index out of range
When I run the same code with ordered=False
, the code runs as expected but with corrupted batches.