CCMpred icon indicating copy to clipboard operation
CCMpred copied to clipboard

Export of raw parameters as numpy array, plus some minor fixes

Open kWeissenow opened this issue 5 years ago • 0 comments

A common usage scenario for plmDCA nowadays is to use the raw Potts model parameters as an input for machine learning devices, especially Deep Learning systems, to infer contact or distance maps. The most recent and prominent example would be DeepMind's AlphaFold, the winner of CASP13. CCMpred is widely used because of its GPU acceleration, but has the drawback of outputting the raw parameters as a text file, which can be huge (>10 GB) for longer proteins. Machine learning systems almost always expect numpy arrays as inputs, which are binary representations and therefore also faster to load since they are more compact.

I've implemented the option to directly write the raw paramters to numpy arrays with the command line switch '-y'. This circumvents the additional step of parsing the text output to generate a binary representation. For long proteins, this makes a huge difference: On a TeslaV100, a MSA with 50k sequences of a protein with 820 residues took 26m13s to process in the traditional way (CCMpred -> raw text file -> parsing file to generate numpy array), whereas running CCMpred and directly writing a numpy array with my implementation took only 16m20s. The speedups are not quite as remarkable for smaller proteins around the average lengths of 200-300 residues, but still account for 1-2 minutes saved per sample. For my current dataset, which contains ~80k MSAs, I expect to save multiple weeks of computation time.

Since I assume that CCMpred is used for exactly this kind of workflow in many structure prediction research projects, I kindly invite you to integrate this addition into the main repository.

kWeissenow avatar Feb 27 '20 09:02 kWeissenow