Added mamba.py backend

Open alxndrTL opened this issue 2 months ago • 5 comments

What does this PR do?

As discussed here, this PR ports the mamba.py backend to transformers. The mamba.py backend is used as a fallback for training when the official CUDA implementation is not found. Currently, the naive version was used (which is a for loop over time). The mamba.py uses a parallel scan (just like the official implementation) which treats all inputs in parallel (technically, in log2(L) steps if enough parallel cores). Here is the performance comparison of the three version :

speed_comparison

Please see the mamba.py repo for a detailed analysis of the performance boost of this version.

Two things to note :

this does not modify the behavior at inference, since the parallel scan is only useful at training where you can treat all your inputs at once.
I've left the possibility to use the naive version via the use_mambapy argument in the MambaConfig (default to True). I have done so because the PyTorch parallel scan implemented in pscan.py is quite memory-hungry, so users with low memory capacity may want to stick with the naive version (although much slower).

You can check that the two versions (naive and mambapy) indeed gives the same numerical output for both the forward and backward pass with this code snippet :

import torch
from src.transformers.models.mamba import MambaForCausalLM, MambaConfig

torch.manual_seed(34567)
config = MambaConfig(vocab_size=60, hidden_size=64, num_hidden_layers=4, use_mambapy=False)
model_orig = MambaForCausalLM(config)

torch.manual_seed(34567)
config = MambaConfig(vocab_size=60, hidden_size=64, num_hidden_layers=4, use_mambapy=True)
model = MambaForCausalLM(config)

x = torch.randint(low=0, high=config.vocab_size-1, size=(16, 120))

y_orig = model_orig(x)
J_orig = y_orig.logits.sum()
J_orig.backward()

y = model(x)
J = y.logits.sum()
J.backward()

print(f"Is the forward pass the same ? {torch.allclose(y_orig.logits, y.logits, atol=0.01)} (difference: {torch.norm(y_orig.logits - y.logits)})")

gradients_same = True
max_diff = 0
for param1, param2 in zip(model_orig.parameters(), model.parameters()):
    diff = torch.norm(param1.grad - param2.grad)
    if diff > max_diff:
        max_diff = diff

    if not torch.allclose(param1.grad, param2.grad, atol=0.01):
        gradients_same = False
        break

print(f"Is the backward pass the same ? {gradients_same} (max difference: {max_diff})")

This is the output I got :

The fast path is not available because one of `(selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)` is None. Falling back to the implementation determined by the argument config `use_mambapy` for training. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1d
Is the forward pass the same ? True (difference: 5.619883449980989e-05)
Is the backward pass the same ? True (max difference: 0.0024947605561465025)

Hope my PR is clear Bests

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

@ArthurZucker

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Apr 09 '24 09:04 alxndrTL

transformers transformers copied to clipboard

Added mamba.py backend

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard