LibRecommender
LibRecommender copied to clipboard
reproducibility issues
When executing https://github.com/massquantity/LibRecommender/blob/master/examples/pure_ranking_example.py, SVD and NCF (the ones I tested) gives every run a different result.
What have changes compared to the previous versions?
By modifying the library and setting
os.environ['PYTHONHASHSEED']=str(self.seed)
random.seed(self.seed)
np.random.seed(self.seed)
tf.set_random_seed(self.seed)
torch.manual_seed(self.seed)
into the build_model() function is a bit better for SVD but the behavior of NCF is strange. I checked the Initialization of embeddings, and the positives and negatives examples to be fed and they are ok.
I don't know what is missing. Do you have any ideas?
Most likely it's the shuffle behavior in data, you can try addingtorch.manual_seed(seed)
before all the code.
Hi, first of all thanks for answering and for your great work with the library!
I added also
os.environ['PYTHONHASHSEED']=str(seed)
random.seed(seed)
np.random.seed(seed)
tf.set_random_seed(seed)
torch.manual_seed(seed)
prior all the code.
SVD looks reproducible now. Besides some non-deterministic behavior in probs computation due to hardware reasons (i.e., GPUs), I get almost always reproducible results with a certain approximation.
However I'm still struggling with NCF.. Do you have some other ideas? I checked the initialization of embeddings and the positives and negatives samples that are fed to the model at each epoch and they are reproducible. I don't understand what is missing..
What I noticed with NCF that seems strange is that if I run the code the first time I get a certain result. If I re-run the code for like 3 to 4 times, at some point I get again the first result..
In my case, i can reproduce the results using NCF. I set tf.set_random_seed(seed)
in build_model()
function and torch.manual_seed(seed)
before all the code. You can also try setting shffle=False
to see what you can get.
I tried with shuffle=False but the results are still strange..
I created a jupyter notebook for reference that may help. If you execute it, you still get reproducible results?
Okay, it seems that adding torch.manual_seed
before the code is not effective. However, adding it in the get_batch_loader
function works.
def get_batch_loader(model, data, neg_sampling, batch_size, shuffle, num_workers=0):
torch.manual_seed(42)
...
sampler = RandomSampler(batch_data) if shuffle else SequentialSampler(batch_data)
batch_sampler = BatchSampler(sampler, batch_size=batch_size, drop_last=False)
collate_fn = get_collate_fn(model, neg_sampling, num_workers)
return DataLoader(
batch_data,
batch_size=None, # `batch_size=None` disables automatic batching
sampler=batch_sampler,
collate_fn=collate_fn,
num_workers=num_workers,
)
Uhmm, I actually tried but I still get the previous results with NCF.
Btw, don't know if you made some recent updates to the library, I downloaded the version of the last week.
Moreover, what version of pandas, numpy, tensorflow, torch and other libraries that are involved are you using?
Yep, you should use the latest commit version. I think recent updates may affect this.
numpy 1.23.4 pandas 1.4.3 TensorFlow 2.12.0 torch 2.0.1 scikit-learn 1.1.1 scipy 1.8.1
My results on NCF:
So, I made a test also with the last commit and the versions of the libraries you put in the previous commit but I still get strange results with NCF
I don't know if it may depend on cuda... I listed all my python packages at the end of the following attached notebook. Is there any differences with yours?
What about CPU results? I'm using CPU since currently i can only have access to my laptop. Based on your list I don't think packages are an issue now.
You are right, with only CPUs (i.e., setting os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
) it works
But what may happen when using GPUs then?
I'm also uncertain about it. Try this, https://wandb.ai/sauravmaheshkar/RSNA-MICCAI/reports/How-to-Set-Random-Seeds-in-PyTorch-and-Tensorflow--VmlldzoxMDA2MDQy
So, I actually tried setting:
random.seed(42)
np.random.seed(42)
#tf.random.set_seed(42) # 'tensorflow.compat.v1.random' has no attribute 'set_seed'
#tf.experimental.numpy.random.seed(42) # 'tensorflow.compat.v1.experimental' has no attribute 'numpy'
tf.set_random_seed(42) # 'tensorflow' has no attribute 'set_random_seed'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
#os.environ['TF_DETERMINISTIC_OPS'] = '1' # Determinism is not yet supported in GPU implementation of Scatter ops with ref inputs.
torch.manual_seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
os.environ['PYTHONHASHSEED']=str(42)
prior executing all the code, in the build_model() function, and in the get_batch_loader() function but I'm still getting non-reproducible results with NCF when trained on GPUs.
Moreover, consider that since we are using tensorflow.compat.v1
, some features are not supported:
-
tf.random.set_seed(42)
->'tensorflow.compat.v1.random' has no attribute 'set_seed'
-
tf.experimental.numpy.random.seed(42)
->'tensorflow.compat.v1.experimental' has no attribute 'numpy'
And there is a strange error when enabling determinism:
-
os.environ['TF_DETERMINISTIC_OPS'] = '1'
->Determinism is not yet supported in GPU implementation of Scatter ops with ref inputs. Consider using resource variables instead if you want to run Scatter when op determinism is enabled. [[{{node Adam/update_embedding/bu_var/ScatterAdd}}]]
Although we mainly use tf1, the package installed is tf2, which may invoke some issues. I think you can set both tf1 and tf2 seed:
import tensorflow as tf2
tf2.set_random_seed(42)
tf2.random.set_seed(42)
tf2.experimental.numpy.random.seed(42)
...
import tensorflow
tensorflow.compat.v1.set_random_seed(42)
...
So, I actually put:
import random
import numpy as np
import torch
import os
import tensorflow as tf2
from libreco.tfops import tf
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
tf2.random.set_seed(42)
tf2.experimental.numpy.random.seed(42)
tf.set_random_seed(42)
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
#os.environ['TF_DETERMINISTIC_OPS'] = '1' # this is still commented due to previous bug
os.environ['PYTHONHASHSEED']=str(42)
prior executing all the code, in the build_model() function and in the get_batch_loader() function.
I'm still getting non-reproducible results with NCF when trained with GPUs.
What I've not mentioned earlier is that when training SVD with GPUs I instead obtain reproducible results...
What's the difference between the models? aren't both based on tensorflow?
Yes they are both implemented in tf. The main difference is that NCF uses additional dense layers.
I've asked ChatGPT, and here is what I got,
In TensorFlow 1.x, setting the random seed for GPU operations requires additional steps compared to setting the random seed for CPU operations. This is because GPU operations involve additional sources of randomness that are not directly controlled by the TensorFlow random seed.
To ensure deterministic behavior with GPU operations in TensorFlow 1.x, you need to follow these steps:
-
Set the global random seed: Set the random seed using
tf.set_random_seed(seed)
. This will seed the random number generator for CPU operations. -
Configure GPU behavior: To make GPU operations deterministic, you need to set the environment variable
CUDA_VISIBLE_DEVICES
to restrict TensorFlow to use only one visible GPU. This step is necessary because multiple GPUs might introduce non-determinism due to their asynchronous nature.
Additionally, you can set the environment variable TF_CUDNN_USE_AUTOTUNE
to 0 to disable cuDNN's auto-tuner, which can introduce non-determinism in cuDNN-based operations.
Here's how you can set the random seed for GPU operations in TensorFlow 1.x:
import tensorflow.compat.v1 as tf
# Set the random seed for TensorFlow CPU operations
seed = 42
tf.set_random_seed(seed)
# Configure GPU behavior to make it deterministic
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Set this to the GPU you want to use (e.g., GPU 0)
# Optionally, disable cuDNN's auto-tuner
os.environ["TF_CUDNN_USE_AUTOTUNE"] = "0"
Please note that the steps mentioned above are specifically for TensorFlow 1.x. In TensorFlow 2.x, the process for setting random seeds and ensuring determinism with GPU operations has been simplified. In TensorFlow 2.x, you can typically achieve determinism by setting the random seed without additional steps for GPU operations. However, the specific details may vary depending on the version and the GPU backend being used.
Ok.. So I also added:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TF_CUDNN_USE_AUTOTUNE"] = "0"
as per chatgpt suggestion.
But still get struggle with NCF unfortunately :(
Yeah, looks like there is nothing we can do about this problem :)
we can only cry XD Anyway, besides this curious issue the library is very good. I'm going to use it for an incremental training project
I found that when using batch_size=256 I get non-reproducible results, but when reducing batch_size to 128 I get (almost) reproducible ones
LOL