Albert Zeyer

Results 880 comments of Albert Zeyer
trafficstars

> With the new `partition` method, the sampling and shuffling will be done identical for all GPUs, but then a different partition is selected per GPU. This is done using...

> if you have a huge dataset and you want to shuffle it you will always need to store an `O(total_num_seq)` list of integers. Ok, with #568, you get rid...

> > Actually, if you use HDFDataset and disable its cache, it does not have the problem, because load_seqs does nothing then. > > Oh, I didn't think of that....

Btw, I just saw [this Twitter post on large data shuffling behavior](https://twitter.com/borisdayma/status/1447939363296489473), which might be relevant for you.

> Let's say 100M sequences. This is very-high ressource, but definitely realistic for MT data nowadays. > > I used this script to simulate what is currently done in `get_seq_order_for_epoch()`:...

> I will also try your fix for `shard`. Please also make this as a separate PR.

I just realized that the hang itself is in a `sess.run` which is not related at all to Horovod. However: * Searching for similar TensorFlow related problems (hangs in `DoConvolve`,...

To add, `dmesg` shows these messages, which might be related: ``` [Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0: AER: Corrected error received: id=8018 [Thu Aug 27 20:06:05 2020] pcieport 0000:80:03.0:...

@Spotlight0xff reported that this might be related to `OMP_NUM_THREADS`. Specifically, he observed the hangs with `OMP_NUM_THREADS=6` but not anymore with `OMP_NUM_THREADS=1`. RETURNN also might use this value for `intra_op_parallelism_threads`/`inter_op_parallelism_threads` (there...

@Spotlight0xff Is that still used for `intra_op_parallelism_threads`/`inter_op_parallelism_threads` as well, or now independent (because it correctly uses your SGE num_proc setting)? In the first case, it means this is somehow related...