Pre-computed files need to be regenerated for each set of parameters

Open shervinea opened this issue 4 years ago • 0 comments

Context. Real-time PDB parsing with the BioPython package, e.g. typically: https://github.com/shervinea/enzynet/blob/31d30e0272e0c9425e0c76085761f211b89f8b7c/enzynet/pdb.py#L53 is expensive and bottlenecks the training process if done on the fly.

For this reason, we put in place a "precomputation stage" https://github.com/shervinea/enzynet/blob/31d30e0272e0c9425e0c76085761f211b89f8b7c/enzynet/volume.py#L123 that takes all enzymes beforehand and stores target volumes in a dedicated folder.

Current limitation. This process is repeated for each set of parameters {weights considered, interpolation level between atoms p, volume size}. This is ineffective from the perspectives of:

total computations performed: PDB parsing is the same for all these configurations and needs to be identically repeated for each of them. The only remaining operations are relatively cheap: e.g. 2D -> 3D mapping, points interpolation. With a proper implementation, these last steps can easily be done on the fly without becoming a bottleneck.
space: the number/size of produced files increases with the same pace as the number of configurations that the user tries out (!).

Desired behavior. Coordinates + weights precomputation from PDB files is done only once and produces a parsed version of the data that is:

Light enough so that it can be transformed to target volumes on the fly
Complete enough so that all configurations' data can be derived from them.

Sep 10 '21 04:09 shervinea