Albert Zeyer

Results 1028 comments of Albert Zeyer

Yes, I have. Are you saying I'm wrong, or just double checking? The code for group-normalization is (with added comments, and prefix, for case G=1): ``` B, T, F =...

> So in the original comment do you meant `moments` on [T,F] with `G=1` vs moments on `[F]` with layer-normalizzation? Yes exactly. That's not the same. The behavior is very...

I think the GroupNorm implementation is fine. At least it seems to be what is commonly implemented. I think the LayerNorm implementation is also fine. This seems to be the...

Yes, right. If you use LN and configure it to normalize over all axes (except batch), then `gamma`/`beta` is wrong. Which is not even possible for my example, as T...

> resulting in a CPU overcommit (because the number of assigned CPUs matches the number of data processes), adversely affecting performance. How do you know this is really negatively affecting...

> Look at the load values. The job is assigned 48 cores by SLURM, but seems to produce a ~190 15min load average. But why is this bad? What I...

I explained this already: When considering hyperthreading, and/or locality of data, I can imagine that multiple threads per each worker can anyway be beneficial (independent of how much other workers...

Sometimes also like this: ``` ... ep 28 train, step 56, ctc_4 2.616, ctc_8 2.268, ctc 2.221, num_seqs 8, max_size:time 278344, max_size:out-spatial 67, mem_usage:cuda:0 6.3GB, 0.658 sec/step ep 28 train,...

I just ran the [Colab from above](https://colab.research.google.com/gist/tilakrayal/6996de1c84dbdf1431043370d1c9ea08/52200.ipynb) with recent TF 2.17.0, and the same error still occurs.

I'm reading the code of `_checkpoint_without_reentrant_generator`. It looks like this uses a couple of techniques which are very relevant for what we need: There is logic for `preserve_rng_state`. It gets...