GraphBin2 icon indicating copy to clipboard operation
GraphBin2 copied to clipboard

add filter for short contigs?

Open nick-youngblut opened this issue 4 years ago • 2 comments

graphbin2 doesn't seem to scale very well for large assemblies with large number of contigs. Given that a big fraction of the contigs generated by metaSPAdes are usually small, and there's no contig length cutoff for spades, would it be possible to add a contig length cutoff to graphbin2 (e.g., all contigs <1kb are skipped) in order to speed up the algorithm, or does the algorithm require all contigs in order to function properly?

nick-youngblut avatar Dec 16 '20 10:12 nick-youngblut

I believe that I created a method to pre-filter out all contigs and speed up graphbin2. In order to get the code running effectively, I had to make huge changes, so a PR doesn't make much sense. Some things that I changed in the code that I found to be beneficial for reading & running graphbin2:

  • Used argparse command => subcommand structure for calling graphbin2_SPAdes.py (or graphbin2_SGA.py) instead of using os.system to call the code. This change greatly helps with debugging exceptions, which an os.system call of a script will not provide
  • Used the logging package for status output instead of print(), given that at least on some machines, the tqdm stderr output will be written prior to the print stdout, which causes confusion when reading the log
  • Used "my string {}".format(integer) method for formatting strings
  • When possible, created specific exceptions (eg., except ValueError) instead of general exceptions (ie., except)
  • Generally tried to format the code using pep8

nick-youngblut avatar Dec 28 '20 12:12 nick-youngblut

Hello @nick-youngblut,

Thank you for the question. GraphBin2 was originally designed to recover short contigs as much as possible. Hence, we did not put introduce a filter for short contigs. However, I understand that this can be a scaling issue with very large datasets. I'm glad you were able to modify the code as you need. Thank you for sharing the details of the things you changed. I will add a fix providing the option to filter out contigs in future.

Thank you!

Vini2 avatar Jan 27 '21 23:01 Vini2