PEPPAN icon indicating copy to clipboard operation
PEPPAN copied to clipboard

Neighborhood based paralog splitting does not finish

Open marade opened this issue 4 years ago • 2 comments

For ~200 ~6Mb bacteria genomes, the neighborhood based paralog splitting step alone is taking over 24 hours on a c5.2xlarge EC2 instance, while the previous steps finished in a timely fashion. Notably the CPU usage for the entire period is very low (less than 1%), while memory usage remains fairly constant at 40%, indicating some sort of CPU bottleneck.

marade avatar Jan 16 '20 04:01 marade

Hi, thank you for the report. This is certainly much much slower than my tests. According to your text, this is most likely to have a bottleneck in the I/O.

PEPPA writes and reads lots of data from the file system. This does not seem to be an issue in my test, even when I used a mounted netdrive. But I have not tested it in an AWS instance yet. I have updated PEPPA a little bit to optimize its I/O performance. However, please do not expect too much.

zheminzhou avatar Jan 16 '20 14:01 zheminzhou

Thanks, I appreciate the prompt support. Perhaps you could add some sort of debugging capability so that the issue can be isolated? I'm not eager to run something for hours and not get an answer.

marade avatar Jan 16 '20 20:01 marade