Sam Havens
Sam Havens
@Arij-Aladel yes, in the original post Ofir says > The T5 model uses no positional information in cross-attention and I would recommend doing the same thing.
Hi @metacarbon, I have [a PR here](https://github.com/mosaicml/llm-foundry/pull/101) which hopefully addresses the issue you are running into. Basically, `load_path` is for Composer checkpoints; there is a different syntax for models loaded...
Before training starts, are you seeing a warning about some layer weights not being used?
I notice this PR has been open for a while; what's its status?
Not having to deal with CUDA for an RMSnorm kernel is appealing, yeah 😄 it's not high priority currently, but I wanted to make sure to keep tabs on this...
> except that it's applied in a somewhat nonstandard way in the fwd pass of the transformer module @tginart Can you say more about this?
Also ACKing the request for a blog post.
If you are running transformer models locally without GPU, including MPT, you should probably checkout the GGML project. There is an open PR to add support for MPT: https://github.com/ggerganov/ggml/pull/145
When I run this I am seeing the s3 download fail at 29% with ```sh Downloading ift/jsonl_test: 0%| | 0.00/906k [00:00
@hanlint yes, my understanding is that this support would need us to output the attention matrices from the attention module, which is something that can't happen with flash attention; meaning...