Kyle Gorman
Kyle Gorman
Thanks for looking into this @Adamits. (Sorry @Othergreengrasses I wasn't in front of a computer with a GPU, so I was just looking at the CLI arguments.) Thanks for finding...
> Hey y'all, asked for this error to be put up. I think the main issue is this line: > > > p_gen += self.W_emb(target_embeddings) + self.bias.expand(... > > Pytorch...
My $.02 is that when they work, in-place operations are honkin' great, but sometimes they have a weird effect on the computation graph.
I am weakly opposed. It is a big source of complexity in FairSeq and we don't have any reason to suppose it improves things on this task. (That said, fork...
`examples` is the wild west, do what you will there, within reason ;)
Dumb question but how is this different than the type of decoder-only LM we were talking about?
I think Adam has an implementation in his fork, but hasn’t PRed it yet. On Tue, Feb 6, 2024 at 12:06 PM Travis Bartley ***@***.***> wrote: > It's exactly that....
+1. Makes sense.
So this is an approximation/hack, right? I'm fine with it, and maybe we could treat it as a separate architecture to keep things simple.
> > So this is an approximation/hack, right? > > I don't think so. I think e.g. the attention of t wrt t-1 will always be the same. So caching...