Results 13 comments of Emily

True, but it should not be asymptotically slower.

It's somewhat expectation breaking - containers typically has the optimal asymptotics, and it is intuitive to expect IntMap to be a strict performance improvement over Map Int, but making that...

A good alternative to fix the invertibility issue would be to use the LU decomposition (which is included in the code, in model.py, but not used by default), with the...

To follow up on this, I implemented the orthogonality penalty, as a simple -20*||(w'w - I)||_F^2 term in the objective function (at invertible_1x1_conv). That is, the summed elementwise squared difference...

Hm... I don't use docker so I didn't notice this broke. I think tensorflow 1.14 or later should be compatible with this though, so maybe it'll just work if it...

Is this with the 345M model? I've found it only just fits in a 1080TI, so anything using substantial vram like a browser running in the background can push it...

iocaposk8, that change is one way to reduce the memory usage. You are basically shortening the model's memory there, by allowing it to only remember the last 512 words instead...

`past` is used for incremental evaluation of the model, similar to [Fast Wavenet Generation Algorithm](https://arxiv.org/abs/1611.09482). It caches the intermediate layer activations at previous tokens so that the next token can...

Hmm, I haven't tried something like this, so I don't know for sure. That said, it may be as simple as editing `n_layer` in the model `hparams.json`, to a smaller...

I don't see why not. Well, I suppose it depends exactly what you mean by "from scratch". For learning a many tasks I would probably start with the released GPT-2...