Experimental compatibility of COBRA with PenguiN viral assemblies

Open Enkabloza opened this issue 7 months ago • 1 comments

Dear COBRA developers

Thank you for your great tool. I am working with viral metagenomic assemblies generated using PenguiN, a recent tool developed by the Soeding Lab for extracting informative viral contigs from short-read metagenomic data.

PenguiN generates high-quality contigs with viral signatures, but due to its non-standard assembly approach, the contigs are not guaranteed to follow the same fragmentation patterns as those from MEGAHIT, IDBA_UD, or metaSPAdes.

However, I believe PenguiN contigs could benefit from the reassembly and circularization logic provided by COBRA, especially since PenguiN assemblies are often conservative and result in fragmented but accurate sequences.

Would you consider allowing COBRA to process PenguiN output with a custom assembler flag (e.g., -a penguin) that:

Accepts user-defined --mink and --maxk
Skips assembler-specific assumptions on overlap length
Issues a warning, but proceeds

This would allow exploratory testing and benchmarking by users.

I would be happy to test this feature with viral datasets and provide example outputs for benchmarking.

Best regards

May 06 '25 05:05 Enkabloza

Hi,

Thank you for your interest in COBRA and your question.

I had a quick look at the penguin paper and found it is based on DBG as well. However, I could not ensure if the output contigs follow the same rules as I tested for other assemblers, unless a comprehensive analysis is done. If you want to try it, you could directly add "penguin" to this line in the code:

choices=["idba", "megahit", "metaspades"]

Please let me know how it works. Thank you.

Best, LINXING

May 06 '25 07:05 linxingchen