last-genome-alignments icon indicating copy to clipboard operation
last-genome-alignments copied to clipboard

align one genome to another fragmented genome

Open AlisaGU opened this issue 11 months ago • 11 comments

Hi, I have two huge genomes (ref: 48G; query: 20G). To obtain an accurate genome alignment, I tried to annotate the transposable element (TE) and filter the non-TE sequence longer than 50bp to be the reference genome. Unexpectedly, the last-train step was too slow and I had to kill it after 7 days run.

I tried to reverse the ref and query, and things seemed worse. Nothing was outputted after two days run.

Could you give me some tips to run? Can I ignore the last-train step and align them directly? Or is there a better way?

Best regards,

AlisaGU avatar Mar 03 '24 15:03 AlisaGU

Please can you show your commands/options for lastdb and last-train, and also the version (e.g. lastdb --version)?

mcfrith avatar Mar 03 '24 22:03 mcfrith

Sure. Version: lastal 1542 lastdb command: $lastdb -P 20 ${reference_abbre} ${reference_genome} last-train: ${last_train} -P 20 --revsym -D1e9 --sample-number=5000 ${reference_abbre} ${query_genome_sequence} >${train_outfile}

the distribution of ref genome after remove the TE: image

genomic fragment less than 50 bp will be filtered.

AlisaGU avatar Mar 04 '24 01:03 AlisaGU

Thanks! Since last-train only uses a sample of the query, I wouldn't expect it to be so slow. I guess the slowness may be caused by running out of memory.

I basically suggest following the "Aligning human & chimp genomes" recipe here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook.rst For your huge genomes, I would add this lastdb option: --bits=4.

In the recipe, -uRY128 reduces the run time and memory use. But it lowers the sensitivity, which is fine for closely-related genomes, but not distantly-related ones. If your genomes are distantly-related, you could try something like -uRY4 or -uRY8.

I guess it's not necessary to remove TEs (but I don't know for sure).

mcfrith avatar Mar 04 '24 04:03 mcfrith

TE accounts for about 90% of the ref genome, and the removal of TE is for the speed-up of genome alignment. So the slow is unexpected. So, is it faster to use the whole genome with no TE removal?

AlisaGU avatar Mar 04 '24 15:03 AlisaGU

Sure, I would expect removing 90% TEs to be faster.

mcfrith avatar Mar 04 '24 21:03 mcfrith

However, removing 90% TE is slower for the last-train step, and I have no idea about how to deal with that.

Can I ignore last-train and run the lastal step directly?

AlisaGU avatar Mar 05 '24 01:03 AlisaGU

Yes, you can ignore last-train, and run lastal directly. Then it will use some default, non-trained parameters. Which might work quite well, or badly, depending on your data.

But last-train should be much faster than the alignment step, whether you remove TEs or not... (I wouldn't use -D1e9 --sample-number=5000.)

mcfrith avatar Mar 05 '24 01:03 mcfrith

ok, let me try the last-train without -D1e9 --sample-number=5000

AlisaGU avatar Mar 05 '24 02:03 AlisaGU

I wouldn't use -D1e9 --sample-number=5000

It's also slow.

I have a train file using the whole reference genome and query genome before. Can I use this train file as the lastal input?

AlisaGU avatar Mar 05 '24 08:03 AlisaGU

Yes, that train file sounds fine.

mcfrith avatar Mar 05 '24 08:03 mcfrith

Thanks!

AlisaGU avatar Mar 05 '24 09:03 AlisaGU