mamba icon indicating copy to clipboard operation
mamba copied to clipboard

Question Regarding Randomness

Open wyhsleep opened this issue 1 year ago • 24 comments
trafficstars

Hello,

Thanks for your interesting work but I have a question about the code that I'd like to discuss with you. Despite fixing all the random seeds, I'm still observing randomness in the results of my runs. And I use thetorch.use_deterministic_algorithms(True) and run the example code as follows: `mport torch from mamba_ssm import Mamba torch.use_deterministic_algorithms(True)

batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda") model = Mamba( # This module uses roughly 3 * expand * d_model^2 parameters d_model=dim, # Model dimension d_model d_state=16, # SSM state expansion factor d_conv=4, # Local convolution width expand=2, # Block expansion factor ).to("cuda") y = model(x) assert y.shape == x.shape`

I got the error message: Traceback (most recent call last): File "test.py", line 14, in <module> y = model(x) File "/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "//lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 136, in forward self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"), RuntimeError: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True)orat::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility It seems that there is a randomness in the mamba module. Did you guys encounter this before? Thank you for your help and thank you for your great job again!

wyhsleep avatar Jan 29 '24 08:01 wyhsleep

Did you follow the suggestion in the error message?

tridao avatar Jan 29 '24 08:01 tridao

Yes, but the randomness still remains

Get Outlook for iOShttps://aka.ms/o0ukef


From: Tri Dao @.> Sent: Monday, January 29, 2024 4:31:59 PM To: state-spaces/mamba @.> Cc: SLEEPNOW @.>; Author @.> Subject: Re: [state-spaces/mamba] Question Regarding Randomness (Issue #137)

Did you follow the suggestion in the error message?

— Reply to this email directly, view it on GitHubhttps://github.com/state-spaces/mamba/issues/137#issuecomment-1914199232, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARUZYXXEOS76KGLVN3WAY3DYQ5M75AVCNFSM6AAAAABCO75UY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUGE4TSMRTGI. You are receiving this because you authored the thread.Message ID: @.***>

wyhsleep avatar Jan 29 '24 08:01 wyhsleep

I'm not sure where the randomness is from. Can you comment out lines in the Mamba implementation to isolate?

tridao avatar Jan 29 '24 08:01 tridao

Hi, I tried to analyze where the randomness comes from. And I find that during training, when running the same model under the same settings, the last few digits of the loss start to differ after the first iteration. However, if I remove the mamba module from our model, the loss returns to normal. We are wondering if this is related to computational precision. We use 32-bit float data precision for calculations and employ the built-in cross-entropy loss from the torch library as the loss optimizer, which is optim.Adam.

wyhsleep avatar Jan 31 '24 06:01 wyhsleep

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

tridao avatar Jan 31 '24 06:01 tridao

Oh, noted with thanks. So the randomness here is normal, right?

wyhsleep avatar Jan 31 '24 06:01 wyhsleep

Normal if you're training the model, not normal if you're only doing inference (forward pass only).

tridao avatar Jan 31 '24 06:01 tridao

Oh, great thanks!!!

wyhsleep avatar Jan 31 '24 06:01 wyhsleep

Oh, great thanks!!!

There's a specific call for cuda's deterministic processing in torch, also setting a seed for randomness is often important. This should make the backpropagation more repeatable in training which is useful if trying to compare hyperparameter or training changes. I also oddly find that using deterministic processing for backpropagation changes the convergent behaviour of a model, oddly increasing the rate of convergence, but that may be just my specific model.

Using RAdam instead of Adam also provides better repeatability due to its controlled warmup of parameters (given that Adam internally optimises learning rate, when we are setting a learning rate for Adam, we are essentially setting a global learning rate).

ElliottDyson avatar Feb 02 '24 22:02 ElliottDyson

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

Charlie839242 avatar Mar 03 '24 12:03 Charlie839242

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

One would have to change the backward pass implementation to not use atomic adds. I personally don't have bandwidth for this but we welcome contributions.

tridao avatar Mar 03 '24 18:03 tridao

The backward pass is not deterministic due to atomic adds. The forward pass should be deterministic.

Hi, I am wondering if there is a way to fix the behaviour of atomic adds? Looking forward to your reply!

One would have to change the backward pass implementation to not use atomic adds. I personally don't have bandwidth for this but we welcome contributions.

Thanks for you replay : )

Charlie839242 avatar Mar 04 '24 06:03 Charlie839242

Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs x1 and x2, the result of model(torch.stack([x1, x2]) (i.e. batching) differs from torch.stack([model(x1), model(x2)]), especially if I use fp16 or bf16 (gap is very close if I use fp32). Is this also an expected behavior?

sangkeun00 avatar Mar 12 '24 02:03 sangkeun00

Even for the forward pass, I noticed that results are somewhat unstable in my experiments. Given two inputs x1 and x2, the result of model(torch.stack([x1, x2]) (i.e. batching) differs from torch.stack([model(x1), model(x2)]), especially if I use fp16 or bf16 (gap is very close if I use fp32). Is this also an expected behavior?

Can you isolate which layer or function that first produces different outputs?

tridao avatar Mar 12 '24 02:03 tridao

I also found that mamba will bring randomness during forward propagation and greatly affect model convergence.

mhy9989 avatar May 14 '24 14:05 mhy9989

I also found that mamba will bring randomness during forward propagation and greatly affect model convergence.

Can you isolate which layer or function that first produces different outputs?

tridao avatar May 14 '24 21:05 tridao

When I use "torch.use_deterministic_algorithms(True)", I got this, adding "CUBLAS_WORKSPACE_CONFIG=:4096:8" doesn't help. I hope this helps for the issue.

File "/export/scratch/ra63nev/lab/zigma/dis_mamba/mamba_ssm/modules/mamba_simple.py", line 295, in _mamba_inner_forward
    self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"),
    ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

dongzhuoyao avatar May 27 '24 12:05 dongzhuoyao

When I use "torch.use_deterministic_algorithms(True)", I got this, adding "CUBLAS_WORKSPACE_CONFIG=:4096:8" doesn't help. I hope this helps for the issue.

File "/export/scratch/ra63nev/lab/zigma/dis_mamba/mamba_ssm/modules/mamba_simple.py", line 295, in _mamba_inner_forward
    self.in_proj.weight @ rearrange(hidden_states, "b l d -> d (b l)"),
    ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility

Did you also try the 16:8 environment variable it also mentioned? Sorry if it seems I'm pointing out the obvious, it's just you may have accidentally missed it.

ElliottDyson avatar May 28 '24 12:05 ElliottDyson

Just curious if anyone found out the root cause of the randomness in the inference run? I am generating an ONNX model and I'm trying to compare the outputs from pytorch vs ONNX. With this randomness, it is difficult

xiaoliangbai avatar Jun 25 '24 21:06 xiaoliangbai

I've also noticed this issue, but my forward propagation is the same. After one iteration of backward propagation, inconsistencies appear. In addition, I've found that even when setting num_workers > 0 in DataLoader, I still encounter the error "DataLoader worker (pid(s) 15804) exited unexpectedly." I can only set num_workers to 0.

Now I'm very troubled, as I can't use num_workers > 0 to speed up, and also can't tune parameters due to the inherent randomness of Mamba.

GuHY777 avatar Jul 18 '24 03:07 GuHY777

I found that setting a larger value for num_workers eliminates the issue of DataLoader worker (pid(s) 15804) exiting unexpectedly, which is similar to setting num_workers to 25 in https://github.com/hustvl/Vim/blob/main/vim/scripts/pt-vim-t.sh.

GuHY777 avatar Jul 22 '24 12:07 GuHY777