Thomas Melistas
Thomas Melistas
I may be too naive, because i have never written cuda kernels, but wouldn't just changing float to half in the original cuda code work?
Haha nice and quick hack, I'll give it a try. If I have time I will do some reading on mixed precision, because it could really speed up the causal...
After a few iterations I got the following: ``` Traceback (most recent call last): File "train_performer.py", line 180, in model_engine.backward(loss) File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/engine.py", line 922, in backward self.optimizer.backward(loss) File "/usr/local/lib/python3.6/dist-packages/deepspeed/runtime/zero/stage1.py", line...
I think it fixes it too, but when generating with fp16 there is the following error: ``` Traceback (most recent call last): File "train_performer.py", line 198, in out = model.generate(inp.to(device),...
Oh, I see I did this one myself. Almost correct, I think you have to do the `kwargs.update(context_mask = context_mask)` outside the if, because if context_mask is provided it is...
Just to let you know it won't work with [amp/apex](https://github.com/NVIDIA/apex) because of the https://github.com/lucidrains/performer-pytorch/issues/44#issuecomment-741034223. But fp16 works fine.
Hey, I should add that I observed a lot of NaN losses with fp16.
Yep, I noticed it with deepspeed fp16 and amp (under deepspeed) opt_level O2 (O1 would give the https://github.com/lucidrains/performer-pytorch/issues/44#issuecomment-741034223)
Yep, I forgot adding `amp_enabled = True`. I will try it after the current training finishes. I used clipping in the form `"max_grad_norm": 0.5` like you do in the enwik8...
I'll let you know about deepspeed and amp O1 in a few days too