dgl Fused Neigbhor Sampling reproducibility issue

Setting the seed and repeating the fused neighborhood sampling for a source code does not reproduce the same subgraph, have identified a fix that will be slower but allow reproducible subgraphs

To Reproduce

Steps to reproduce the behavior:

Set OMP_NUM_THREADS to >= 2
Define NeighborSampler with fused=True
Set dgl.seed, dgl.random.seed before calling dgl.dataloading.NeighborSampler.sample_blocks
Repeat step 3. and compare blocks[0].srcdata['feat']

The following change allowed for the above to have the same output

diff --git a/src/graph/sampling/neighbor/neighbor.cc b/src/graph/sampling/neighbor/neighbor.cc
index 5393a200..471600dd 100644
--- a/src/graph/sampling/neighbor/neighbor.cc
+++ b/src/graph/sampling/neighbor/neighbor.cc
@@ -14,6 +14,7 @@

 #include <tuple>
 #include <utility>
+#include <execution>

 #include "../../../array/cpu/concurrent_id_hash_map.h"
 #include "../../../c_api_common.h"
@@ -488,6 +489,7 @@ SampleNeighborsFused(
         }
       }
       IdType offset = new_nodes_vec[lhs_node_type].size();
+      auto length = offset;
       new_nodes_vec[lhs_node_type].resize(global_prefix_col.back());
       for (int thread_id = 0; thread_id < num_threads_col; ++thread_id) {
         memcpy(
@@ -496,6 +498,7 @@ SampleNeighborsFused(
             src_nodes_local[thread_id].size() * sizeof(IdType));
         offset += src_nodes_local[thread_id].size();
       }
+    std::sort(std::execution::par_unseq, new_nodes_vec[lhs_node_type].begin() + length, new_nodes_vec[lhs_node_type].end());
     }
   }

This results is the sampling being slower but reproducible, would there be any alternative for multithreaded fused sampling being reproducible?

Environment

DGL Version (e.g., 1.0): 2.4
Backend Library & Version : Torch 2.3.1
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source): cmake -DBUILD_TYPE='release' -DUSE_LIBXSMM=ON -DUSE_OPENMP=ON -DBUILD_SPARSE=OFF -DUSE_EPOLL=OFF ..
Python version: 3.9
CPU: Intel Xeon

Oct 28 '24 20:10 taran2210

Yes, this is a known issue, our implementation of concurrent_id_hash_map can't guarantee deterministic while maintain the high performance. Feel free to file a PR and add a if-else branch to use deterministic solution while user specify the random seed or add other flag to control the behavoir.

https://github.com/dmlc/dgl/blob/master/graphbolt/src/concurrent_id_hash_map.cc

Nov 07 '24 07:11 frozenbugs

Yes, this is a known issue, our implementation of concurrent_id_hash_map can't guarantee deterministic while maintain the high performance. Feel free to file a PR and add a if-else branch to use deterministic solution while user specify the random seed or add other flag to control the behavoir.

https://github.com/dmlc/dgl/blob/master/graphbolt/src/concurrent_id_hash_map.cc

GraphBolt's hashmap should be deterministic, both for CUDA and CPU implementations.

Nov 25 '24 04:11 mfbalin

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Dec 26 '24 01:12 github-actions[bot]