enzynet icon indicating copy to clipboard operation
enzynet copied to clipboard

Pre-computed files need to be regenerated for each set of parameters

Open shervinea opened this issue 4 years ago • 0 comments

Context. Real-time PDB parsing with the BioPython package, e.g. typically: https://github.com/shervinea/enzynet/blob/31d30e0272e0c9425e0c76085761f211b89f8b7c/enzynet/pdb.py#L53 is expensive and bottlenecks the training process if done on the fly.

For this reason, we put in place a "precomputation stage" https://github.com/shervinea/enzynet/blob/31d30e0272e0c9425e0c76085761f211b89f8b7c/enzynet/volume.py#L123 that takes all enzymes beforehand and stores target volumes in a dedicated folder.

Current limitation. This process is repeated for each set of parameters {weights considered, interpolation level between atoms p, volume size}. This is ineffective from the perspectives of:

  • total computations performed: PDB parsing is the same for all these configurations and needs to be identically repeated for each of them. The only remaining operations are relatively cheap: e.g. 2D -> 3D mapping, points interpolation. With a proper implementation, these last steps can easily be done on the fly without becoming a bottleneck.
  • space: the number/size of produced files increases with the same pace as the number of configurations that the user tries out (!).

Desired behavior. Coordinates + weights precomputation from PDB files is done only once and produces a parsed version of the data that is:

  1. Light enough so that it can be transformed to target volumes on the fly
  2. Complete enough so that all configurations' data can be derived from them.

shervinea avatar Sep 10 '21 04:09 shervinea