TOGA icon indicating copy to clipboard operation
TOGA copied to clipboard

HAL alignments and chaining: post processing TOGA outputs

Open vinitamehlawat opened this issue 1 year ago • 10 comments

Hi Dr. @MichaelHiller

I have couple of queries regarding my TOGA outputs. I am working with vertebrates genomes (size ranging from 550Mb to 1Gb, all genomes that I considered are of very good quality genomes, BUSCO > 90%). I wanted to explore gene loss among my selected group of species (PS: attached Phylogenetic fig of my data). Currently I considered only one species as reference, which I marked as Ref in tree. I wanted to see how many genes and what genes are lost in all my query genomes.

For this I used HAL alignment because your pipeline make_lastz_chains is NOT working with test data https://github.com/hillerlab/make_lastz_chains/issues/46 and https://github.com/hillerlab/make_lastz_chains/issues/45. I also raised issue there but I guess because of busy schedule they are not yet resolved.

To convert Hal into chain I used following command:

  1. hal2fasta and then faToToBit for .2bit
  2. halStats to .bed sequences; for all query genomes to get .psl
  3. halLiftover to get .psl between ref and target
  4. axtChain to get the .chain alignment between each genome.

This hal to chain process took 3-4 hrs for most of the genomes but for some it took 6-7 hrs.

Further I used these .chain files to run the TOGA, because my reference genome is very well annotated and with isoform file I got list of genes which are lost and other category as well in loss_summ_data.tsv.

I varified with some already published genome that found some genes lost for my focal/query species and I also got these genes as 'lost' in my loss_summ file.

Some questions regarding my outputs:

  1. Is this method correct to get the gene loss (I mean using Hal to chain)
  2. how to arrange the pairwise comparison – so that we can logically determine the actual “loss” in the “query”, and rule out the “gain” in the “ref”.
  3. For example in one query genome I got 1000 gene lost but out of 1000 loss are 453 and rest are "novel gens" in Ref species. Out of these 453 lots of genes are have identical gene descriptions but still have different gene ids

I would really appreciate any suggestion from your end.

Thank you very much

Slide11

vinitamehlawat avatar Jan 31 '24 15:01 vinitamehlawat