torchscale
torchscale copied to clipboard
Foundation Architecture for (M)LLMs
Thank you for your great work! I've noticed that your decoder_retention_heads is set to 3 by default, and the mask is also expanded to three dimensions to match. Have you...
Bumps [pyarrow](https://github.com/apache/arrow) from 9.0.0 to 14.0.1. Commits ba53748 MINOR: [Release] Update versions for 14.0.1 529f376 MINOR: [Release] Update .deb/.rpm changelogs for 14.0.1 b84bbca MINOR: [Release] Update CHANGELOG.md for 14.0.1 f141709...
Bumps [transformers](https://github.com/huggingface/transformers) from 4.8.1 to 4.36.0. Release notes Sourced from transformers's releases. v4.36: Mixtral, Llava/BakLlava, SeamlessM4T v2, AMD ROCm, F.sdpa wide-spread support New model additions Mixtral Mixtral is the new...
This is a simple fix to the issue of pytorch no longer has torch._six.
Hello, I followed the blog post https://zenn.dev/selllous/articles/retnet_tutorial shared in #52 in order to train RetNet, and it seems to work well for small models (< 3B). But I am unable...
The fix of normalization Rnm is totally wrong. The added max value in clam needed because of wrong placement of abs() operation. More thorough explanation I put here: https://github.com/microsoft/torchscale/commit/fdd8838a756c7c435d7f8a1e4303e150dfac7442#commitcomment-134758047 Commented...
Bumps [scipy](https://github.com/scipy/scipy) from 1.6.3 to 1.10.0. Release notes Sourced from scipy's releases. SciPy 1.10.0 Release Notes SciPy 1.10.0 is the culmination of 6 months of hard work. It contains many...
As opposed to the other architectures in this package, RetNet doesn't have support for padding as far as I'm aware. I was thinking the best place to introduce it was...
Hi, first of all thank you for the nice work. I was reading the paper and found the weight decay mentioned in the appendix is different from the one mentioned...
In the RetNet model, embed _ tokens is not given, I can 't run the code. When I use this model, what should the parameter token _ embeddings pass ?...