torchmd-net icon indicating copy to clipboard operation
torchmd-net copied to clipboard

POC: neighbor search

Open raimis opened this issue 2 years ago • 2 comments

This is a proof-of-concept. DO NOT MERGER!

  • [x] Python wrapper
  • [x] CPU
    • [x] Forward pass
  • [x] CUDA
    • [x] Forward pass
    • [x] Backward pass
  • [x] Tests
  • [ ] Documentation

raimis avatar Mar 03 '22 17:03 raimis

You've structured the kernel so that every thread computes only a single interaction:

    const int32_t index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index >= num_all_pairs) return;

    int32_t row = floor((sqrtf(8 * index + 1) + 1) / 2);
    if (row * (row - 1) > 2 * index) row--;
    const int32_t column = index - row * (row - 1) / 2;

    const scalar_t delta_x = positions[row][0] - positions[column][0];
    const scalar_t delta_y = positions[row][1] - positions[column][1];
    const scalar_t delta_z = positions[row][2] - positions[column][2];

Usually it's better to use a smaller number of thread blocks and have each thread loop over interactions. For one thing, there's overhead to each thread block. For another, it allows lots of optimization. In the above code, if you can arrange that each thread will compute multiple pairs all in the same row, then you can skip the row and column computations, and also you only need to load positions[row] once.

Of course, it all depends what size you're optimizing for. With 50 atoms, the number of pairs is much too small to fill a large GPU even with only one pair per thread. For larger systems with thousands of atoms and millions of pairs, it will make more of a difference.

peastman avatar Apr 25 '22 17:04 peastman

This PR is discontinued. The code is being move to NNPOps (https://github.com/openmm/NNPOps/pull/58)

raimis avatar May 18 '22 14:05 raimis

This is obsolted

raimis avatar Jan 04 '24 15:01 raimis