transformers
transformers copied to clipboard
Added mamba.py backend
What does this PR do?
As discussed here, this PR ports the mamba.py backend to transformers
. The mamba.py backend is used as a fallback for training when the official CUDA implementation is not found.
Currently, the naive version was used (which is a for loop over time). The mamba.py uses a parallel scan (just like the official implementation) which treats all inputs in parallel (technically, in log2(L)
steps if enough parallel cores).
Here is the performance comparison of the three version :
Please see the mamba.py repo for a detailed analysis of the performance boost of this version.
Two things to note :
- this does not modify the behavior at inference, since the parallel scan is only useful at training where you can treat all your inputs at once.
- I've left the possibility to use the naive version via the
use_mambapy
argument in theMambaConfig
(default to True). I have done so because the PyTorch parallel scan implemented inpscan.py
is quite memory-hungry, so users with low memory capacity may want to stick with the naive version (although much slower).
You can check that the two versions (naive and mambapy) indeed gives the same numerical output for both the forward and backward pass with this code snippet :
import torch
from src.transformers.models.mamba import MambaForCausalLM, MambaConfig
torch.manual_seed(34567)
config = MambaConfig(vocab_size=60, hidden_size=64, num_hidden_layers=4, use_mambapy=False)
model_orig = MambaForCausalLM(config)
torch.manual_seed(34567)
config = MambaConfig(vocab_size=60, hidden_size=64, num_hidden_layers=4, use_mambapy=True)
model = MambaForCausalLM(config)
x = torch.randint(low=0, high=config.vocab_size-1, size=(16, 120))
y_orig = model_orig(x)
J_orig = y_orig.logits.sum()
J_orig.backward()
y = model(x)
J = y.logits.sum()
J.backward()
print(f"Is the forward pass the same ? {torch.allclose(y_orig.logits, y.logits, atol=0.01)} (difference: {torch.norm(y_orig.logits - y.logits)})")
gradients_same = True
max_diff = 0
for param1, param2 in zip(model_orig.parameters(), model.parameters()):
diff = torch.norm(param1.grad - param2.grad)
if diff > max_diff:
max_diff = diff
if not torch.allclose(param1.grad, param2.grad, atol=0.01):
gradients_same = False
break
print(f"Is the backward pass the same ? {gradients_same} (max difference: {max_diff})")
This is the output I got :
The fast path is not available because one of `(selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)` is None. Falling back to the implementation determined by the argument config `use_mambapy` for training. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1d
Is the forward pass the same ? True (difference: 5.619883449980989e-05)
Is the backward pass the same ? True (max difference: 0.0024947605561465025)
Hope my PR is clear Bests
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
@ArthurZucker
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.