TOGA
TOGA copied to clipboard
HAL alignments and chaining: post processing TOGA outputs
Hi Dr. @MichaelHiller
I have couple of queries regarding my TOGA outputs. I am working with vertebrates genomes (size ranging from 550Mb to 1Gb, all genomes that I considered are of very good quality genomes, BUSCO > 90%). I wanted to explore gene loss among my selected group of species (PS: attached Phylogenetic fig of my data). Currently I considered only one species as reference, which I marked as Ref in tree. I wanted to see how many genes and what genes are lost in all my query genomes.
For this I used HAL alignment because your pipeline make_lastz_chains is NOT working with test data https://github.com/hillerlab/make_lastz_chains/issues/46 and https://github.com/hillerlab/make_lastz_chains/issues/45. I also raised issue there but I guess because of busy schedule they are not yet resolved.
To convert Hal into chain I used following command:
-
hal2fasta
and thenfaToToBit
for .2bit -
halStats
to .bed sequences; for all query genomes to get .psl -
halLiftover
to get .psl between ref and target -
axtChain
to get the .chain alignment between each genome.
This hal to chain
process took 3-4 hrs for most of the genomes but for some it took 6-7 hrs.
Further I used these .chain files to run the TOGA, because my reference genome is very well annotated and with isoform file I got list of genes which are lost and other category as well in loss_summ_data.tsv
.
I varified with some already published genome that found some genes lost for my focal/query species and I also got these genes as 'lost' in my loss_summ file.
Some questions regarding my outputs:
- Is this method correct to get the gene loss (I mean using Hal to chain)
- how to arrange the pairwise comparison – so that we can logically determine the actual “loss” in the “query”, and rule out the “gain” in the “ref”.
- For example in one query genome I got 1000 gene lost but
out of 1000 loss are 453
and rest are "novel gens" in Ref species. Out of these 453 lots of genes are have identical gene descriptions but still have different gene ids
I would really appreciate any suggestion from your end.
Thank you very much