Fused Neigbhor Sampling reproducibility issue
Setting the seed and repeating the fused neighborhood sampling for a source code does not reproduce the same subgraph, have identified a fix that will be slower but allow reproducible subgraphs
To Reproduce
Steps to reproduce the behavior:
- Set OMP_NUM_THREADS to >= 2
- Define NeighborSampler with fused=True
- Set dgl.seed, dgl.random.seed before calling dgl.dataloading.NeighborSampler.sample_blocks
- Repeat step 3. and compare blocks[0].srcdata['feat']
The following change allowed for the above to have the same output
diff --git a/src/graph/sampling/neighbor/neighbor.cc b/src/graph/sampling/neighbor/neighbor.cc
index 5393a200..471600dd 100644
--- a/src/graph/sampling/neighbor/neighbor.cc
+++ b/src/graph/sampling/neighbor/neighbor.cc
@@ -14,6 +14,7 @@
#include <tuple>
#include <utility>
+#include <execution>
#include "../../../array/cpu/concurrent_id_hash_map.h"
#include "../../../c_api_common.h"
@@ -488,6 +489,7 @@ SampleNeighborsFused(
}
}
IdType offset = new_nodes_vec[lhs_node_type].size();
+ auto length = offset;
new_nodes_vec[lhs_node_type].resize(global_prefix_col.back());
for (int thread_id = 0; thread_id < num_threads_col; ++thread_id) {
memcpy(
@@ -496,6 +498,7 @@ SampleNeighborsFused(
src_nodes_local[thread_id].size() * sizeof(IdType));
offset += src_nodes_local[thread_id].size();
}
+ std::sort(std::execution::par_unseq, new_nodes_vec[lhs_node_type].begin() + length, new_nodes_vec[lhs_node_type].end());
}
}
This results is the sampling being slower but reproducible, would there be any alternative for multithreaded fused sampling being reproducible?
Environment
- DGL Version (e.g., 1.0): 2.4
- Backend Library & Version : Torch 2.3.1
- OS (e.g., Linux): Linux
- How you installed DGL (
conda,pip, source): source - Build command you used (if compiling from source): cmake -DBUILD_TYPE='release' -DUSE_LIBXSMM=ON -DUSE_OPENMP=ON -DBUILD_SPARSE=OFF -DUSE_EPOLL=OFF ..
- Python version: 3.9
- CPU: Intel Xeon
Yes, this is a known issue, our implementation of concurrent_id_hash_map can't guarantee deterministic while maintain the high performance. Feel free to file a PR and add a if-else branch to use deterministic solution while user specify the random seed or add other flag to control the behavoir.
https://github.com/dmlc/dgl/blob/master/graphbolt/src/concurrent_id_hash_map.cc
Yes, this is a known issue, our implementation of concurrent_id_hash_map can't guarantee deterministic while maintain the high performance. Feel free to file a PR and add a if-else branch to use deterministic solution while user specify the random seed or add other flag to control the behavoir.
https://github.com/dmlc/dgl/blob/master/graphbolt/src/concurrent_id_hash_map.cc
GraphBolt's hashmap should be deterministic, both for CUDA and CPU implementations.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you