torchscale
torchscale copied to clipboard
Foundation Architecture for (M)LLMs
Hello, I have followed the training configuration introduced here (https://github.com/microsoft/torchscale/issues/52) with retnet_medium architecture. I have some questions that I would appreciate if anyone could answer them. The first is about...
I've rewritten the `torchscale.architecture.config` module to use inheritance and remove the redundant code. There are now 3 classes: `Config` - that holds all common options `EncoderConfig` - inherits 'Config' and...
5 classes in the codebase inherit from `object` for some reason. I am guessing it was some sort of an oversight.
This is kind of a simple-minded question, but what do I do if I want to see for myself that I can process a huge attention window using torchscale? Ideally,...
This pull request adds support for the Flash Attention mechanism to the MultiheadAttention module. Flash Attention is a recently proposed alternative to the conventional multi-head attention mechanism which reduces memory...
Thanks for your excellent work! I have mentioned that torchscale serially executes the operation of mapping x to q, k, and v, in line 84~86 in file torchscale/component/multihead_attention.py. Will this...