learnergy icon indicating copy to clipboard operation
learnergy copied to clipboard

probability bug

Open kyyongh opened this issue 3 years ago • 6 comments

When I use GaussianRBM to run rbm_classification.py on GPU, these errors appear.

11%|█ | 51/469 [00:01<00:07, 58.24it/s]C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [416,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [417,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [418,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [419,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [420,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [421,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [422,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [423,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [424,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [425,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [426,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [427,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [428,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [429,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [430,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [431,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [432,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [433,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [434,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [435,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [436,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [437,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [438,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [439,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [440,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [441,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [442,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [443,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [444,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [445,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [446,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [6,0,0], thread: [447,0,0] Assertion 0 <= p4 && p4 <= 1 failed.

RuntimeError: CUDA error: device-side assert triggered

But if I run it on cpu, the error while like this: RuntimeError: Expected p_in >= 0 && p_in <= 1 to be true, but got false.

It seems like the function hidden_sampling dosen't run correctly.

kyyongh avatar Jul 15 '22 14:07 kyyongh

Hello kyyongh,I hope is everything well with you.

Without knowing what is your initialization setup may be difficult to track such an issue. Even if the lack of additional information, seems that it is something wrong with the temperature parameter (T). However, if you can attach additional information about the running parameters, we can search the problem.

Best regards, Mateus.

MateusRoder avatar Jul 15 '22 15:07 MateusRoder

Hi Mateus, thank you for replying.

I was running the rbm_classification.py with only change the model to GaussianRBM, keeping the initialization parameters same as the RBM: model = GaussianRBM( n_visible=784, n_hidden=128, steps=1, learning_rate=0.1, momentum=0, decay=0, temperature=1, use_gpu=True, ) and the above mentioned error occurred.

when I use GPU, the error woud be: C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [5,0,0], thread: [316,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [5,0,0], thread: [317,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [5,0,0], thread: [318,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [5,0,0], thread: [319,0,0] Assertion 0 <= p4 && p4 <= 1 failed. 12%|█▏ | 56/469 [00:02<00:16, 25.34it/s] Traceback (most recent call last): File "E:\Document\机器学习\GitRepository\learnergy\examples\applications\bernoulli\rbm_classification.py", line 43, in model.fit(train, batch_size=batch_size, epochs=1) File "E:\Document\机器学习\GitRepository\learnergy\learnergy\models\gaussian\gaussian_rbm.py", line 214, in fit samples = samples.cuda() RuntimeError: CUDA error: device-side assert triggered

But when I use CPU, the error would be: 12%|█▏ | 55/469 [00:01<00:12, 31.99it/s] Traceback (most recent call last): File "E:\Document\机器学习\GitRepository\learnergy\examples\applications\bernoulli\rbm_classification.py", line 43, in model.fit(train, batch_size=batch_size, epochs=1) File "E:\Document\机器学习\GitRepository\learnergy\learnergy\models\gaussian\gaussian_rbm.py", line 217, in fit _, _, _, _, visible_states = self.gibbs_sampling(samples) File "E:\Document\机器学习\GitRepository\learnergy\learnergy\models\bernoulli\rbm.py", line 348, in gibbs_sampling pos_hidden_probs, pos_hidden_states = self.hidden_sampling(v) File "E:\Document\机器学习\GitRepository\learnergy\learnergy\models\bernoulli\rbm.py", line 296, in hidden_sampling states = torch.bernoulli(probs) RuntimeError: Expected p_in >= 0 && p_in <= 1 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

It seems something went wrong when use torch.bernoulli to get the state in hidden_sampling, but I can't address this problem by myself.

In addition, I let the num_worker=0 when use the DataLoader in rbm_classification.py, such as: train_batch = DataLoader(train, batch_size=batch_size, shuffle=False, num_workers=0) val_batch = DataLoader(test, batch_size=10000, shuffle=False, num_workers=0) if not, 'RuntimeError: DataLoader worker (pid(s) 2568) exited unexpectedly ' will be raised. I hope this modify would not lead the errors mentioned above.

Waiting for your reply.

Best regards, kyyongh.

kyyongh avatar Jul 16 '22 07:07 kyyongh

Hi Mateus, thank you for replying.

I was running the rbm_classification.py with only change the model to GaussianRBM, keeping the initialization parameters same as the RBM: model = GaussianRBM( n_visible=784, n_hidden=128, steps=1, learning_rate=0.1, momentum=0, decay=0, temperature=1, use_gpu=True, ) and the above mentioned error occurred.

when I use GPU, the error woud be: C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [5,0,0], thread: [316,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [5,0,0], thread: [317,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [5,0,0], thread: [318,0,0] Assertion 0 <= p4 && p4 <= 1 failed. C:/cb/pytorch_1000000000000/work/aten/src\ATen/native/cuda/DistributionTemplates.h:591: block: [5,0,0], thread: [319,0,0] Assertion 0 <= p4 && p4 <= 1 failed. 12%|█▏ | 56/469 [00:02<00:16, 25.34it/s] Traceback (most recent call last): File "E:\Document\机器学习\GitRepository\learnergy\examples\applications\bernoulli\rbm_classification.py", line 43, in model.fit(train, batch_size=batch_size, epochs=1) File "E:\Document\机器学习\GitRepository\learnergy\learnergy\models\gaussian\gaussian_rbm.py", line 214, in fit samples = samples.cuda() RuntimeError: CUDA error: device-side assert triggered

But when I use CPU, the error would be: 12%|█▏ | 55/469 [00:01<00:12, 31.99it/s] Traceback (most recent call last): File "E:\Document\机器学习\GitRepository\learnergy\examples\applications\bernoulli\rbm_classification.py", line 43, in model.fit(train, batch_size=batch_size, epochs=1) File "E:\Document\机器学习\GitRepository\learnergy\learnergy\models\gaussian\gaussian_rbm.py", line 217, in fit _, _, _, _, visible_states = self.gibbs_sampling(samples) File "E:\Document\机器学习\GitRepository\learnergy\learnergy\models\bernoulli\rbm.py", line 348, in gibbs_sampling pos_hidden_probs, pos_hidden_states = self.hidden_sampling(v) File "E:\Document\机器学习\GitRepository\learnergy\learnergy\models\bernoulli\rbm.py", line 296, in hidden_sampling states = torch.bernoulli(probs) RuntimeError: Expected p_in >= 0 && p_in <= 1 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

It seems something went wrong when use torch.bernoulli to get the state in hidden_sampling, but I can't address this problem by myself.

In addition, I let the num_worker=0 when use the DataLoader in rbm_classification.py, such as: train_batch = DataLoader(train, batch_size=batch_size, shuffle=False, num_workers=0) val_batch = DataLoader(test, batch_size=10000, shuffle=False, num_workers=0) if not, 'RuntimeError: DataLoader worker (pid(s) 2568) exited unexpectedly ' will be raised. I hope this modify would not lead the errors mentioned above.

Waiting for your reply.

Best regards, kyyongh.

------------------ 原始邮件 ------------------ 发件人: "gugarosa/learnergy" @.>; 发送时间: 2022年7月15日(星期五) 晚上11:53 @.>; @.@.>; 主题: Re: [gugarosa/learnergy] probability bug (Issue #15)

Hello kyyongh,I hope is everything well with you.

Without knowing what is your initialization setup may be difficult to track such an issue. Even if the lack of additional information, seems that it is something wrong with the temperature parameter (T). However, if you can attach additional information about the running parameters, we can search the problem.

Best regards, Mateus.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kyyongh avatar Jul 16 '22 07:07 kyyongh

Hello @kyyongh. Thanks for your additional information. I am going to check the GaussianRBM code and re-test it.

In addition, I would like to suggest one experimentation to you. I guess that you are running the GaussianRBM on MNIST, right?! Just to our sanity, lower the learning rate to 0.01, or less, just to check if it is not exploding in some way the hidden activation (like NaN values in case of the gradient explodes).

I wait your reply about this little parameter modification, while I keep searching some bug by this side.

Best regards, Mateus Roder.

MateusRoder avatar Jul 16 '22 13:07 MateusRoder

Hello Mateus. Thanks for your suggestion. After trying lower the learning rate equal or less than 0.01, the gradient explodes won't be raised when I running the GaussianRBM on MNIST.

And there is one more thing I want to figure out. 

I've find some others code about RBM, most of them choose to update parameters manually instead of using nn.Module just like the theory that Hinton introduced in his paper. In this way, the loss function is log likelihood, and the gradient comes from the derivative of the log likelihood. For example, when updating weights, they use CD algorithm to approximate the gradient of parameters, then update parameters just like   w += learning_rate * Δw. 

But I've also noticed, in learnergy, you used the optimizer of PyTorch to calculate the gradient of energy loss. What I'm curious about is, are these two ways equivalent? How does CD work when using optimizer of PyTorch. z

Thank you for your generosity. And I'm always waiting for your reply.

Best regards, kyyongh.

------------------ 原始邮件 ------------------ 发件人: "gugarosa/learnergy" @.>; 发送时间: 2022年7月16日(星期六) 晚上9:25 @.>; @.@.>; 主题: Re: [gugarosa/learnergy] probability bug (Issue #15)

Hello @kyyongh. Thanks for your additional information. I am going to check the GaussianRBM code and re-test it.

In addition, I would like to suggest one experimentation to you. I guess that you are running the GaussianRBM on MNIST, right?! Just to our sanity, lower the learning rate to 0.01, or less, just to check if it is not exploding in some way the hidden activation (like NaN values in case of the gradient explodes).

I wait your reply about this little parameter modification, while I keep searching some bug by this side.

Best regards, Mateus Roder.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

kyyongh avatar Jul 17 '22 11:07 kyyongh

Hello @kyyongh. Excellent that we don't have such an issue! Here, all the code runs fine, but I pretend to add a message error for the case when the gradient explodes, to facilitate possible error tracking.

Regarding your second point, it is an interesting question, I will try to explain it simply. In RBMs, we aim to maximize the log probabilities given a set of parameters. For that, we employ an energy function model to describe the system equilibrium such that the parameters (weight, biases) are optimized by the log-derivatives of that energy function (wrt to such parameters).

Great, when we employ the manual derivative, we don't need to make explicit the energy function, on the other hand, if we would like to explore more energy functions easily, we can take such an energy function and employ auto-differentiation wrt to model parameters, given an observed configuration (data prob.). When we define the free energy (function 'energy' in learnergy), we are fixing the visible configuration and exploring the hidden probabilities, as you can see here: https://www.youtube.com/watch?v=e0Ts_7Y6hZU&list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH&index=38. This mathematical manipulation supports us to make the CD, up and down, on the network and update the network states, instead of the fixed weight equation. After the state updates, we evaluate the network on its free energy (our loss function) and Pytorch can update the parameters with the SGD in an easy manner.

Now, we can summarize. Since we are making the same operations, differently, both methods are equivalent, and the Pytorch approach leads us the fast computation and stable learning.

I don't know if my explanation was clear enough. If not, please, let me know.

Best regards, Mateus.

MateusRoder avatar Jul 18 '22 13:07 MateusRoder