NOVOPlasty icon indicating copy to clipboard operation
NOVOPlasty copied to clipboard

Extending Assembly from Seed

Open nsmt89 opened this issue 5 years ago • 22 comments

Hi, I am trying to extend my assembly from my contig file, but the assembly still failed to produce a complete genome. Do you have any suggestions on any parameters or methods that I could change to improve my assembly? Attached is my log file. Thanks. log_Cryptocoryne_nurii1.txt

nsmt89 avatar Jan 30 '20 06:01 nsmt89

Did it extend the seed a bit or nothing? And it's best to run again with extended log to 1 and send me that file

ndierckx avatar Jan 30 '20 10:01 ndierckx

Thanks for the reply. I used different contig file as seed and it managed to produce a circularized assembly. However, I suspect the genome is still not complete. Even for my genus, I have not found any data published yet, for published related taxa chloroplast genome showed around 150k to 170k bp. But mine it is just 128938bp. How can I check the data, orientation of inverted repeat and improve my assembly? Is it possible for me to get complete plastome? log_extended_Cryptocoryne_nurii4.txt log_Cryptocoryne_nurii4.txt

nsmt89 avatar Jan 30 '20 15:01 nsmt89

Maybe the inverted repeat is not inverted, is possible in some species and then it circularizes early. Why don't you try a higher min genome range, like 150000, to see wat you get...

ndierckx avatar Jan 30 '20 16:01 ndierckx

Thank you for your suggestion. The output produced 4 smaller contigs after I changed the genome range. What do you think the reason? log_Cryptocoryne_nurii6.txt log_extended_Cryptocoryne_nurii6.txt

nsmt89 avatar Jan 30 '20 23:01 nsmt89

There is a bug that doesn't always output all the contigs when the option "extend seed directly" is used, could you try this version with the same config file (I forgot if I fixed the problem, if not I will do it today) NOVOPlasty3.8.2.zip

ndierckx avatar Jan 31 '20 10:01 ndierckx

So there is a big contig of 160000 assembled, but it just didn't output

ndierckx avatar Jan 31 '20 10:01 ndierckx

And where do you get that large contig from? Is it from a different assembly software?

ndierckx avatar Jan 31 '20 10:01 ndierckx

Yes. Actually I use the largest contig from Fast-Plast assembly. If I did not invoke the extend seed function, the output will only be uncircularized genome. I will try the version you provide ASAP.

nsmt89 avatar Jan 31 '20 23:01 nsmt89

I just tried the version you gave me and its still produce the same result.

nsmt89 avatar Feb 03 '20 05:02 nsmt89

Ok still have to fix, will have a look. Have you tried to run with the previous assembly, but just with a short seed (like the RUBP seed on this github)?

ndierckx avatar Feb 03 '20 17:02 ndierckx

Yes but it will only produce small contig. log_Cryptocoryne_nurii15.txt log_extended_Cryptocoryne_nurii15.txt

nsmt89 avatar Feb 03 '20 23:02 nsmt89

But then you need to switch off extend seed directly, this option should only be used to extend an existing assembly... So could you try again without that option

ndierckx avatar Feb 04 '20 11:02 ndierckx

I think this version should output the larger contig (but I am not sure) NOVOPlasty3.8.2.zip

ndierckx avatar Feb 04 '20 13:02 ndierckx

But then you need to switch off extend seed directly, this option should only be used to extend an existing assembly... So could you try again without that option

I did.. the output produce one contig with 12564 bp length

nsmt89 avatar Feb 05 '20 01:02 nsmt89

I tried your latest version of NOVOPlasty and it managed to output ~160kbp length size contig. How can I merge those contigs or see if I can get them circularized? Merged_contigs_Cryptocoryne_nurii17.txt log_Cryptocoryne_nurii17.txt log_extended_Cryptocoryne_nurii17.txt

nsmt89 avatar Feb 05 '20 05:02 nsmt89

It couldn't circularize automatically and seems you have a complex genome so I am not sure that will be possible.. Could you send the extended log of the 12564 bp run? I am just curious why it is that short. What kind of data are you using, is it WGS, capture or RNA seq?

ndierckx avatar Feb 05 '20 10:02 ndierckx

My data is WGS. Attached is my extended log log_extended_Cryptocoryne_nurii16.txt log_extended_Cryptocoryne_nurii6.txt

nsmt89 avatar Feb 06 '20 01:02 nsmt89

Sorry didn't had the time to look at it earlier, but it seems that 12kb sequence doesn't occur in the 120 kb assembly. The 12 kb region can't get extended further because it is flanked by AT rich regions where the coverage drops to 0, so completely circularising your cp genome seems not possible. But I would keep that 12 kb sequence, because it is part of the chloroplast genome that was missing from the assembly. Could you run it with this seed and send me the extended log:

AAGTATCGTGAATTTCTTCATGCTCGTTCCAAGTTCGAAGTACCATTTGTACAAATAAGAATCCCTTTCCTTACATGATTTCTTCTTCATATAGATAGATATAGGATCTATGGGGCAATTACTTAGAAGTACATTTTGTGCAACAGCCCTTCCTATCTGATAGAAAAGGATCCCATGATCCTGAACCGATCTGACCCGGGATC

ndierckx avatar Feb 10 '20 15:02 ndierckx

Thank you for your time. I tried run with the seed you provided. Attached is my extended log: log_extended_Cryptocoryne_nurii25.txt

nsmt89 avatar Feb 11 '20 01:02 nsmt89

Hi, sorry could you run it again without giving a reference. Reference can help but it reverses the read to assemble it the same direction of the reference, so just would like to see the assembly in the other direction. And is the reference closely related? But it seems this assembly had the same problem of low coverage in another AT rich region.

ndierckx avatar Feb 11 '20 10:02 ndierckx

log_extended_Cryptocoryne_nurii18.txt Attached is the extended log. The reference is not really close but it is the closest that available currently; the same Family but different Order in taxa. I suspected the low coverage problem, thus I am trying to combine different methods or manually explore different parameters to get reliable results. I have tried assembled it with MITObim and it managed to get one single contig (167kb~bp) but the orientation seems wrong when I tried to annotate it so I need to correct the assembly first before annotating it.

nsmt89 avatar Feb 12 '20 07:02 nsmt89

If you use MITObim de novo it should be ok, but don't use the reference based assembly. I checked this mode a few times and it generally just copies the reference sequence in to the assembly, so it will give a false result. It's very misleading because it will give on first sight a good result. Does FAST-PLAST also uses reference genomes, because a part of the sequence could be inaccurate, although can't say for sure, NOVOPlasty could be wrong too

ndierckx avatar Feb 12 '20 13:02 ndierckx