bakta icon indicating copy to clipboard operation
bakta copied to clipboard

produce the --replicons input content based on the flye assembly_info.txt

Open splaisan opened this issue 8 months ago • 3 comments

The tool works great using docker but the naming is not very nice when starting from a flye assembly where contigs are named randomly.

I found the use of --replicons great in that regard but it requires to create a tsv file upfront which is not easy when looping through many assemblies in an integrated pipeline (dozens of assemblies in a row).

Would it be possible to internally create the --replicons input file based on the content of the flye assembly_info.txt file which contains columns #seq_name length cov. circ. and taking the largest contig as chromosome and the others as plasmids?

That would create genbank files which are closer to submission quality

My fix now is to re-run the bakta after all genomes are assembled and after building the --replicons input file by hand

thanks

splaisan avatar Oct 31 '23 14:10 splaisan

Hi @splaisan , thanks for reaching out and asking. Indeed, it would be very nice if Bakta were able to instantly use circularity information from Flye. Actually, this is already possible for Unicycler assemblies from which Bakta extracts circularity information from the Fasta headers. So, whenever an assembled sequence has a circular=true tag in its Fasta header description, Bakta will use that information in the annotation process and output files.

I totally see your point here and I'd like very much to address this. However, I'm a bit reluctant to address this by Flye-specific paramters as there are other assemblers which would soon mess up Bakta's usage. I guess, the better approach would be to ask the Fyle developers to put the required information into the Fasta header, so that Bakta can use the apprach that is already implemented. In addition, this would have the nice bonus, that circularity information on sequences produced by Flye would be stored along with the sequences themselves, instead of additional txt files w/o standardized format. To this end, I've opened an issue in the Flye repo: https://github.com/fenderglass/Flye/issues/647 Maybe, you would like to endorse this?

oschwengers avatar Nov 10 '23 15:11 oschwengers

Can you please give an example of a fasta header that would work. It is really easy to add a script in between to adapt the flye headers and make them compatible, when i have this done I will share it (bash / bioawk most likely) in the issue page. Thanks a lot for your info

splaisan avatar Nov 10 '23 16:11 splaisan

Sure. This is a recent example from a Unicycler assembly: >1 length=4635742 depth=1.00x circular=true

In this case, Bakta is able to extract this information and mark this sequence as complete and circular.

oschwengers avatar Nov 10 '23 17:11 oschwengers