Sample training data takes tens of minutes one epoch on Linux with an A800
When using version 0.10.2 of decode and the same parameter.yaml file, the time to sample training data on Linux with an A800 is significantly larger than on Windows with GTX3080Ti .Both using the parameters below for simulation,
Hardware:
device: cuda:0
device_ix: 0
device_simulation: cuda:0
num_worker_train: 1
torch_multiprocessing_sharing_strategy: null
torch_threads: 4
unix_niceness: 0
Simulation:
bg_uniform:
- 40.0
- 60.0
density: null
emitter_av: 250
emitter_extent:
- - -0.5
- 63.5
- - -0.5
- 63.5
- - -2000
- 2000
img_size:
- 64
- 64
intensity_mu_sig:
- 3000.0
- 100.0
On Windows with GTX3080Ti, the time to sample training data per epoch during training is about 8 seconds. However, on Linux with an A800, it takes tens of minutes (I didn’t wait for it to finish sampling in an epoch because it took too long). I investigated the code and added print statements at key points, and found that it was very slow at the line:
frames = self._spline_impl.forward_frames(*self.img_shape,
frame_ix,
n_frames,
xyz_r[:, 0],
xyz_r[:, 1],
xyz_r[:, 2],
ix[:, 0],
ix[:, 1],
weight)
However, using nvidia-smi, I saw that the GPU utilization was consistently at 100%, which is very strange. I specifically checked the spline library and found that it was compiled with sm_37. Could this be the reason for the performance issue? But sm_37 compiled code does not affect the performance on Windows with GTX 3080Ti. Recompiling to test whether it is the problem is quite difficult for me, so I hope to seek your help.