Pre-computed files need to be regenerated for each set of parameters
Context. Real-time PDB parsing with the BioPython package, e.g. typically: https://github.com/shervinea/enzynet/blob/31d30e0272e0c9425e0c76085761f211b89f8b7c/enzynet/pdb.py#L53 is expensive and bottlenecks the training process if done on the fly.
For this reason, we put in place a "precomputation stage" https://github.com/shervinea/enzynet/blob/31d30e0272e0c9425e0c76085761f211b89f8b7c/enzynet/volume.py#L123 that takes all enzymes beforehand and stores target volumes in a dedicated folder.
Current limitation. This process is repeated for each set of parameters {weights considered, interpolation level between atoms p, volume size}. This is ineffective from the perspectives of:
- total computations performed: PDB parsing is the same for all these configurations and needs to be identically repeated for each of them. The only remaining operations are relatively cheap: e.g. 2D -> 3D mapping, points interpolation. With a proper implementation, these last steps can easily be done on the fly without becoming a bottleneck.
- space: the number/size of produced files increases with the same pace as the number of configurations that the user tries out (!).
Desired behavior. Coordinates + weights precomputation from PDB files is done only once and produces a parsed version of the data that is:
- Light enough so that it can be transformed to target volumes on the fly
- Complete enough so that all configurations' data can be derived from them.