PHANOTATE icon indicating copy to clipboard operation
PHANOTATE copied to clipboard

Continuously increasing RAM demand

Open thackl opened this issue 4 years ago • 14 comments

Hi Katelyn,

thanks for PHANOTATE, great tool!

I'm using it for larger sets of viral contigs, and noticed that it seems to not free up RAM between contigs. On my current set with 100+ sequence, towards the end, it needs >5GB RAM. Obviously, I can split the input file to get around that, but I'd also suspect, it wouldn't be too difficult to clean the RAM after a contig has been processed. Might make sense as an improvement for future versions.

Cheers Thomas

thackl avatar Feb 18 '21 09:02 thackl

Hm, that is strange. In the current version I reset everything between contigs (which I probably shouldn't do since contigs in the same file should probably be treated as the same genome by default; and to treat them as separate genomes use a --meta flag).

I wonder if pythons automatic garbage collector isn't working correctly between contigs to clear old memory. Or perhaps I have a memory leak somewhere.

Thanks for info, I will try to track down the cause.

If you have the time, can you check whether the issue happens with my newer code? I have been developing a 2+ version. The changes in 2.0 alpha are mostly only code optimizations (it should run in half the time), but I did have to make some changes to the way ORFs and gaps/overlaps are scored, and I haven't dialed them in yet, so output might be slightly different. To run the develop branch you can manually install via:

git clone https://github.com/deprekate/PHANOTATE.git
cd PHANOTATE
git checkout develop
python3 setup.py install

phanotate.py tests/NC_001416.1.fasta

deprekate avatar Feb 19 '21 20:02 deprekate

Interesting thought regarding --meta flag. Makes sense to treat contigs from the same genome together - especially shorter ones could profit from any "training" performed also on larger ones. Not sure though if people usually split their phage genomes (or even bacterial bins) into separate files after binning. But if properly documented, the definitely could.

I tried to run the new code. It worked on the test data, but then failed on may contigs with the following error:

phanotate.py  -f tabular pt-loci-r1.fna > pt-loci-r1-phanotate-v2.0.tsv
Traceback (most recent call last):
  File "/home/thackl/software//PHANOTATE/PHANOTATE/phanotate.py", line 47, in <module>
    contig_features.add_feature( trna )
  File "/home/thackl/.local/lib/python3.8/site-packages/phanotate-2.0-py3.8-linux-x86_64.egg/phanotate/features.py", line 100, in add_feature
    self.feature_at[ feature.as_edge()[:2] ] = feature
AttributeError: 'tRNA' object has no attribute 'as_edge'

thackl avatar Feb 19 '21 20:02 thackl

Yeah, the problem with making it default is that most people won't read the fulls docs, and so will run MAG files through as the same genome, which will skew the results.

The tRNA error should be fixed, I forgot to push a change I made to the code.

deprekate avatar Feb 19 '21 20:02 deprekate

Yeah, users ;). You could enforce a choice with a required mode argument, something like phanotate meta [opts] infile / phanotate single [opts] infile, with the former going contig-wise and the letter expecting a single genome with 1+ contigs...

Caught another error...

phanotate.py  -f tabular pt-loci-r1.fna > pt-loci-r1-phanotate-v2.0.tsv
Traceback (most recent call last):
  File "/home/thackl/software//PHANOTATE/phanotate.py", line 84, in <module>
    shortest_path = fz.get_path(source= contig_features.source_node(), target= contig_features.target_node())[1:-1]
ValueError: Graph contains negative weight cycle

thackl avatar Feb 19 '21 20:02 thackl

Ah, that error is caused by the weights not being dialed in yet. I will see if I can get the weights sorted out, and/or track down the cause of the RAM usage. Thanks so for you help and feedback.

deprekate avatar Feb 19 '21 21:02 deprekate

Sure. Happy to help. Debugging my own code as we speak. Let me know if you want me to give it another try

thackl avatar Feb 19 '21 21:02 thackl

The negative weight cycle is caused when the cost to go backwards via an overlap is not enough in relation to a really good gene (in this case most likely a tRNA), so during the path solving step, the Bellman-Ford will go through that gene, then do a backwards overlap, then back through the same gene, infinite times.
I reduced the tRNA from -20 to -1, and pushed the change to github, which should prevent the infinite loop for now

deprekate avatar Feb 19 '21 23:02 deprekate

Hmm, I still get the negative weight cycle error. I'm currently trying to find out, which sequence exactly is causing the error. If I find it and it would help, I can send it to you.

thackl avatar Feb 20 '21 18:02 thackl

OK, now it is getting bizarre. When I split the file into one file per seq, and run phanotate on each - everything runs through fine. But if I run phanotate on the file with all sequences, I get the error after a few sequences... Any ideas?

If it would help, I can share the data, assuming you would handle it confidentially

thackl avatar Feb 20 '21 20:02 thackl

Ah, one of the improvements I made with version 2 is moving the ORF to ORF connections creation step to Cpython. It looks like I forgot to empty the data structure between contigs. I will have to get that fixed.

deprekate avatar Feb 22 '21 20:02 deprekate

Hi, I am also running the dev branch because the master doesn't work on some of my phage sequences. I also get the error: ValueError: Graph contains negative weight cycle Is there anything that can be done?

ilyavs avatar Dec 25 '22 20:12 ilyavs

The "negative weight cycle" error is the reason why I have not pushed this version 2 to the main. I mentioned in this version I had to simplify the weighting of the ORFs but kept the tRNAs the same weight (-20). If I drop the weight to -1 it should run fine, but I have to tune it properly, which I havent done. One reason is that I dont have a contig that generates the error.
Are you able to share the contig that causes the error?

deprekate avatar Dec 28 '22 22:12 deprekate

Unfortunately I can't share the contigs. I am pretty sure there are public sequences that would have the error, I just don't have the capacity to analyze them until I hit the error.

ilyavs avatar Jan 02 '23 12:01 ilyavs

Hi, I am also running the dev branch because the master doesn't work on some of my phage sequences. I also get the error: ValueError: Graph contains negative weight cycle Is there anything that can be done?

I am revisiting this issue, can you clarify if you got the 'negative cycle' error with the develop or main branch? If it was only the dev branch, what was the error that caused the main branch not to work on some phages?

I also set the weight of tRNAs to 0, so neither beneficial or penalized, with the idea that I will add ALL of them back in post processing. Can you test out the develop branch on the contig that gave you the error?

git clone https://github.com/deprekate/PHANOTATE.git
git checkout develop
pip install ../PHANOTATE

deprekate avatar Apr 04 '23 00:04 deprekate