mamba Question Regarding Randomness

trafficstars

Hello,

Thanks for your interesting work but I have a question about the code that I'd like to discuss with you. Despite fixing all the random seeds, I'm still observing randomness in the results of my runs. And I use thetorch.use_deterministic_algorithms(True) and run the example code as follows: `mport torch from mamba_ssm import Mamba torch.use_deterministic_algorithms(True)

batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda") model = Mamba( # This module uses roughly 3 * expand * d_model^2 parameters d_model=dim, # Model dimension d_model d_state=16, # SSM state expansion factor d_conv=4, # Local convolution width expand=2, # Block expansion factor ).to("cuda") y = model(x) assert y.shape == x.shape`

I got the error message: Traceback (most recent call last): File "test.py", line 14, in <module> y = model(x) File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "//lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 136, in forward self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"), RuntimeError: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True)orat::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility It seems that there is a randomness in the mamba module. Did you guys encounter this before? Thank you for your help and thank you for your great job again!

Jan 29 '24 08:01 wyhsleep

Did you follow the suggestion in the error message?

Jan 29 '24 08:01 tridao

Yes, but the randomness still remains

Get Outlook for iOShttps://aka.ms/o0ukef

From: Tri Dao @.> Sent: Monday, January 29, 2024 4:31:59 PM To: state-spaces/mamba @.> Cc: SLEEPNOW @.>; Author @.> Subject: Re: [state-spaces/mamba] Question Regarding Randomness (Issue #137)

Did you follow the suggestion in the error message?

— Reply to this email directly, view it on GitHubhttps://github.com/state-spaces/mamba/issues/137#issuecomment-1914199232, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARUZYXXEOS76KGLVN3WAY3DYQ5M75AVCNFSM6AAAAABCO75UY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUGE4TSMRTGI. You are receiving this because you authored the thread.Message ID: @.***>

Jan 29 '24 08:01 wyhsleep

I'm not sure where the randomness is from. Can you comment out lines in the Mamba implementation to isolate?

Jan 29 '24 08:01 tridao

Hi, I tried to analyze where the randomness comes from. And I find that during training, when running the same model under the same settings, the last few digits of the loss start to differ after the first iteration. However, if I remove the mamba module from our model, the loss returns to normal. We are wondering if this is related to computational precision. We use 32-bit float data precision for calculations and employ the built-in cross-entropy loss from the torch library as the loss optimizer, which is optim.Adam.

Jan 31 '24 06:01 wyhsleep

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Jan 31 '24 06:01 tridao

Oh, noted with thanks. So the randomness here is normal, right？

Jan 31 '24 06:01 wyhsleep

Normal if you're training the model, not normal if you're only doing inference (forward pass only).

Jan 31 '24 06:01 tridao

Oh, great thanks!!!

Jan 31 '24 06:01 wyhsleep

Oh, great thanks!!!

There's a specific call for cuda's deterministic processing in torch, also setting a seed for randomness is often important. This should make the backpropagation more repeatable in training which is useful if trying to compare hyperparameter or training changes. I also oddly find that using deterministic processing for backpropagation changes the convergent behaviour of a model, oddly increasing the rate of convergence, but that may be just my specific model.

Using RAdam instead of Adam also provides better repeatability due to its controlled warmup of parameters (given that Adam internally optimises learning rate, when we are setting a learning rate for Adam, we are essentially setting a global learning rate).

Feb 02 '24 22:02 ElliottDyson

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

Mar 03 '24 12:03 Charlie839242

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

One would have to change the backward pass implementation to not use atomic adds. I personally don't have bandwidth for this but we welcome contributions.

Mar 03 '24 18:03 tridao

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

One would have to change the backward pass implementation to not use atomic adds. I personally don't have bandwidth for this but we welcome contributions.

Thanks for you replay : )

Mar 04 '24 06:03 Charlie839242

Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs x1 and x2, the result of model(torch.stack([x1, x2]) (i.e. batching) differs from torch.stack([model(x1), model(x2)]), especially if I use fp16 or bf16 (gap is very close if I use fp32). Is this also an expected behavior?

Mar 12 '24 02:03 sangkeun00

Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs x1 and x2, the result of model(torch.stack([x1, x2]) (i.e. batching) differs from torch.stack([model(x1), model(x2)]), especially if I use fp16 or bf16 (gap is very close if I use fp32). Is this also an expected behavior?

Can you isolate which layer or function that first produces different outputs?

Mar 12 '24 02:03 tridao

I also found that mamba will bring randomness during forward propagation and greatly affect model convergence.

May 14 '24 14:05 mhy9989

I also found that mamba will bring randomness during forward propagation and greatly affect model convergence.

Can you isolate which layer or function that first produces different outputs?

May 14 '24 21:05 tridao

When I use "torch.use_deterministic_algorithms(True)", I got this, adding "CUBLAS_WORKSPACE_CONFIG=:4096:8" doesn't help. I hope this helps for the issue.

File "/export/scratch/ra63nev/lab/zigma/dis_mamba/mamba_ssm/modules/mamba_simple.py", line 295, in _mamba_inner_forward
    self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"),
    ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

May 27 '24 12:05 dongzhuoyao

When I use "torch.use_deterministic_algorithms(True)", I got this, adding "CUBLAS_WORKSPACE_CONFIG=:4096:8" doesn't help. I hope this helps for the issue.

File "/export/scratch/ra63nev/lab/zigma/dis_mamba/mamba_ssm/modules/mamba_simple.py", line 295, in _mamba_inner_forward
    self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"),
    ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

Did you also try the 16:8 environment variable it also mentioned? Sorry if it seems I'm pointing out the obvious, it's just you may have accidentally missed it.

May 28 '24 12:05 ElliottDyson

Just curious if anyone found out the root cause of the randomness in the inference run? I am generating an ONNX model and I'm trying to compare the outputs from pytorch vs ONNX. With this randomness, it is difficult

Jun 25 '24 21:06 xiaoliangbai

I've also noticed this issue, but my forward propagation is the same. After one iteration of backward propagation, inconsistencies appear. In addition, I've found that even when setting num_workers > 0 in DataLoader, I still encounter the error "DataLoader worker (pid(s) 15804) exited unexpectedly." I can only set num_workers to 0.

Now I'm very troubled, as I can't use num_workers > 0 to speed up, and also can't tune parameters due to the inherent randomness of Mamba.

Jul 18 '24 03:07 GuHY777

I found that setting a larger value for num_workers eliminates the issue of DataLoader worker (pid(s) 15804) exiting unexpectedly, which is similar to setting num_workers to 25 in https://github.com/hustvl/Vim/blob/main/vim/scripts/pt-vim-t.sh.

Jul 22 '24 12:07 GuHY777

mamba mamba copied to clipboard

Question Regarding Randomness

mamba
mamba copied to clipboard