pfeatherstone

Results 403 comments of pfeatherstone

https://ofir.io/train_short_test_long.pdf, the one you reference in your readme. I have to admit I haven't read it in great detail but they suggest AliBI is great.

Basically I need a positional embedding that length-extrapolates well, works with memories, and flash attention. Do you have any suggestions?

What do you mean by curriculum learn to longer sequence lengths? Sorry if my questions are dumb.

Presumably in `apply_rotary_pos_emb()` we need to add: ``` scale = scale[-seq_len:, :] ``` ?

As an aside, why is all the RotaryEmbedding decorated with `@torch.cuda.amp.autocast(enabled = False)` ? You can remove it with just a couple tweaks and it supports `torch.bfloat16`.

Also, I think the `scale` calculation is incorrect when using mems since the positions are off. You have to use the same trick of starting from negative position.

https://github.com/lucidrains/x-transformers/pull/234 I believe this fixes it.

Other candidates are Alibi or no embeddings at all. For the last one, in order for it to work, do you need to train with a range of sizes so...

Then if i change: ``` *x.shape, ``` to ``` x.shape[0], x.shape[1] ``` I get another error: ``` x_transformers.py", line 1238, in forward rotary_pos_emb = self.rotary_pos_emb(max_rotary_emb_length) return _VF.einsum(equation, operands) # type:...

It would seem that during normal inference `max_rotary_emb_length` is an `int`, during JIT tracing or ONNX export it's a 0-dimensional tensor. EDIT: It looks like generally something like `x.shape[0]` is...