x-transformers Pay Attention When Required

First, thanks for the great repo!

Here's a recent paper from NVIDIA: https://arxiv.org/pdf/2009.04534v2.pdf Seems like a similar concept to Sandwich, but faster, simpler, and near identical perplexity.

Edit: Oh, I see you mention it already. Is there a parameter exposed for it already?

This was further corroborated by a paper by Nvidia that reduces the number of attention layers to be 1/3rd of the feedforwards without loss in performance.

Jan 02 '21 20:01 lunixbochs

I think this commit implements layer allocation for PAR compatible with the paper:

Though, it might make more sense to set par_depth to depth * len(default_block) if anyone is mixing this with other block types.

depth = 16
par_ratio = 5
default_block = ('a', 'f')

par_depth = depth * 2
assert 1 < par_ratio <= par_depth, 'par ratio out of range'
while default_block[-1] == 'f':
    default_block = default_block[:-1]
par_attn  = par_depth // par_ratio
# 2/3 attention layer cutoff suggested by PAR paper
depth_cut = par_depth * 2 // 3
par_width = (depth_cut + depth_cut // par_attn) // par_attn
assert len(default_block) <= par_width, 'default block is too large for par_ratio'
par_block = default_block + ('f',) * (par_width - len(default_block))
par_head = par_block * par_attn
layer_types = par_head + ('f',) * (par_depth - len(par_head))

print('default_block:', ''.join(default_block))
print('par_attn:     ', par_attn)
print('par_width:    ', par_width)
print('par_block:    ', ''.join(par_block))
print('par_head:     ', ''.join(par_head))
print('layer_types:  ', ''.join(layer_types))
print('last_attn:    ', ''.join(layer_types).rindex('a') + 1)

output:

default_block: a
par_attn:      6
par_width:     4
par_block:     afff
par_head:      afffafffafffafffafffafff
layer_types:   afffafffafffafffafffafffffffffff
last_attn:     21

Jan 02 '21 21:01 lunixbochs

Haha yup, I had this paper in mind, thus custom_layers keyword you can pass in, though undocumented https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py#L364

Jan 02 '21 22:01 lucidrains

Actually I think that's naively wrong too. It should be len(default_block) for the first 2/3 section but only 2 for the rest, otherwise you'll get len(default_block)-2 extra ff layers

On Jan 2, 2021, at 6:24 PM, Phil Wang [email protected] wrote:

Ohh sorry I didn't read your last message carefully. Turns out I did build PAR, I just forgot about it lol

Ok I'll make the default block change

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Jan 03 '21 02:01 lunixbochs

@lunixbochs https://github.com/lucidrains/x-transformers/commit/af23656c81614a65fe05cfaa79be97af12d09ea8 Threw it out there for now :) Yea, I don't think they explored this when cross attention layer is present

Jan 03 '21 02:01 lucidrains

we'll just hit on the general idea

Jan 03 '21 02:01 lucidrains

x-transformers x-transformers copied to clipboard

Pay Attention When Required

x-transformers
x-transformers copied to clipboard