torchmd-net
torchmd-net copied to clipboard
POC: neighbor search
This is a proof-of-concept. DO NOT MERGER!
- [x] Python wrapper
- [x] CPU
- [x] Forward pass
- [x] CUDA
- [x] Forward pass
- [x] Backward pass
- [x] Tests
- [ ] Documentation
You've structured the kernel so that every thread computes only a single interaction:
const int32_t index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= num_all_pairs) return;
int32_t row = floor((sqrtf(8 * index + 1) + 1) / 2);
if (row * (row - 1) > 2 * index) row--;
const int32_t column = index - row * (row - 1) / 2;
const scalar_t delta_x = positions[row][0] - positions[column][0];
const scalar_t delta_y = positions[row][1] - positions[column][1];
const scalar_t delta_z = positions[row][2] - positions[column][2];
Usually it's better to use a smaller number of thread blocks and have each thread loop over interactions. For one thing, there's overhead to each thread block. For another, it allows lots of optimization. In the above code, if you can arrange that each thread will compute multiple pairs all in the same row, then you can skip the row and column computations, and also you only need to load positions[row]
once.
Of course, it all depends what size you're optimizing for. With 50 atoms, the number of pairs is much too small to fill a large GPU even with only one pair per thread. For larger systems with thousands of atoms and millions of pairs, it will make more of a difference.
This PR is discontinued. The code is being move to NNPOps (https://github.com/openmm/NNPOps/pull/58)
This is obsolted