FALCON icon indicating copy to clipboard operation
FALCON copied to clipboard

Next Step After having the FALCON Assembly in hand

Open bostanict opened this issue 8 years ago • 20 comments

Hi,

I have now several assemblies of our genome (670MB) in hand with different parameters. I want to know what are the next steps to finalize the assembly. Quiver? Resequncing protocol?

I read some postis about arrow or quiver but did not get a clear idea how to proceed.

In fact we want to know what are the steps after having the p_ctg and a_ctg and their order to get the best outcome and have good assembly in hand for the downstream analyses.

Thanks in advance~

bostanict avatar Oct 24 '16 18:10 bostanict

Quiver / Arrow (for sequel data) would be the next step. Merge the p_ctg and a_ctg files before correction. https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowTo.rst

Then depending on how far you want to take the assembly you could further scaffold using a long range technology or genetic map.

rhallPB avatar Oct 24 '16 18:10 rhallPB

Hi @rhallPB ,

Thanks for the reply. At the first step, we want to do the most we can do by the sequence and pacbio data itself. When we are quite sure that there is not much to do more here, we will proceed with the scaffolding and also missassembly detection using genetic maps. Probably after that, we might also repeat what we do now for improving the assembly quality right?

So I want to know if only running the Quiver/Arrow is enough? I heard somewhere that we should run resequencing from smrtportal after we have the draft assembly. Is it the same with Quiver/Arrow or that can also help?

thanks

bostanict avatar Oct 24 '16 19:10 bostanict

Resequencing in smrtportal = Quiver / Arrow. Technically quiver / arrow is the algorithm that calls consensus given an alignment, resequencing generates the alignment (using blasr) then calls consensus.

rhallPB avatar Oct 24 '16 19:10 rhallPB

Great. And one more question. We are also planning to run Falcon_Unzip on the assembly. Do you suggest to Run it at the end after the QV? Thanks

bostanict avatar Oct 24 '16 19:10 bostanict

Run Falcon unzip directly after falcon, unzip uses Quiver to call consensus on the final unzipped contigs, using the phased reads, removing the need for a further round of quiver.

rhallPB avatar Oct 24 '16 19:10 rhallPB

Sorry to ask again but to make sure I got it right, so If I run Falcon unzip, there is no need to run quiver/arrow anymore since it is already performed during the unzip, Right?

bostanict avatar Oct 24 '16 19:10 bostanict

Yes provided you have a lot of coverage. If you don't have enough coverage then you may not end up with enough phased reads for each contig (phased haplotig) to generate a good quiver consensus. The easiest way to tell is use a conserved gene or gene set and measure frame shifts. If you do still see significant frame shifts after unzipping, you could try mapping the raw reads back to all the unziped contigs in order to recover more depth and run quiver again, but you run the risk of mixing data that you went to a lot of trouble to phase.

rhallPB avatar Oct 24 '16 19:10 rhallPB

Our genome is diploid and highly heterozygous. It is 670 MB and we have around 75-80 X coverage.

for the conserved genes, we are going to use BUSCO as advised in another thread here.

bostanict avatar Oct 24 '16 20:10 bostanict

Unzip and correction are experimental, I would certainly suggest using BUSCO gene completeness as a measure of sequence quality at different stages.

rhallPB avatar Oct 24 '16 20:10 rhallPB

Thanks alot @rhallPB , helpful as always~

bostanict avatar Oct 24 '16 20:10 bostanict

Hi,

Can I ask some related questions? For diploid genomes, since we want to have a final haploid assembly, merging the p_ctg and a_ctg files before correction and scaffolding will result in two homologous copies for some parts in the genome, which is not desired. How can this be solved? Another question is that, if I have already polished p_ctg, will it make a difference to polish p_ctg and a_ctg files separately?

Best, Quan

danshu avatar Oct 25 '16 02:10 danshu

The issue is if you simply align against the p_ctg for correction you will be mixing haplotypes in the consensus. If you are only interested in the haploid assembly you should still include the a_ctg in the correction, then remove them. The most dramatic result of not including the a_ctg in correction is that SNP differences between haplotypes tend to get called as deletions in the consensus, introducing frame shifts. Also you may want to check that multiple haplotypes have not been assembled out in the p_ctg, which can be the case for highly divergent regions. https://github.com/skingan/HomolContigsByAnnotation

rhallPB avatar Oct 25 '16 15:10 rhallPB

Thank you so much for your explanation! @rhallPB

danshu avatar Oct 26 '16 10:10 danshu

I have also another question here since it following the same story,

cam you give me an idea how long quiver and unzip will take to finish comparing to falcon itself.

I just want to know if we need a cluster for that as for as falcon or we can run them on a single powerful machine.

Thanks alot

bostanict avatar Oct 27 '16 20:10 bostanict

The unzip portion isn't as computationally intensive, or as parallel as the initial Falcon run, but it does involve a lot of data access. For a 670Mb genome if cluster access is an issue then it may be worth trying on a single powerful machine.

rhallPB avatar Oct 27 '16 21:10 rhallPB

and the quiver / arrow?

bostanict avatar Oct 27 '16 21:10 bostanict

Full quiver / arrow including the mapping of all the raw reads will take about as long as the initial Falcon assembly and should be ran on a cluster.

rhallPB avatar Oct 27 '16 21:10 rhallPB

Thanks

bostanict avatar Oct 28 '16 12:10 bostanict

@rhallPB I have a quick question, if any experience and in practice, which one gives better outcome for a diploid and highly heterozygous genome? Quiver or Arrow?

bostanict avatar Nov 17 '16 16:11 bostanict

From experience arrow results is fewer frame shifts in predicted genes for complex genomes, but in controlled validation on ecoli, quiver has a slight advantage. Obviously thats for RSII data, arrow is the only option for sequel.

On Nov 17, 2016, at 8:17 AM, bostanict <[email protected]mailto:[email protected]> wrote:

@rhallPBhttps://github.com/rhallPB I have a quick question, if any experience and in practice, which one gives better outcome for a diploid and highly heterozygous genome? Quiver or Arrow?

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/PacificBiosciences/FALCON/issues/463#issuecomment-261291524, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADhjSOjBodsM9nYb3I11ww1aRKtA9DEgks5q_H3rgaJpZM4KfHHD.

rhallPB avatar Nov 17 '16 17:11 rhallPB