last-genome-alignments
last-genome-alignments copied to clipboard
align one genome to another fragmented genome
Hi, I have two huge genomes (ref: 48G; query: 20G). To obtain an accurate genome alignment, I tried to annotate the transposable element (TE) and filter the non-TE sequence longer than 50bp to be the reference genome. Unexpectedly, the last-train step was too slow and I had to kill it after 7 days run.
I tried to reverse the ref and query, and things seemed worse. Nothing was outputted after two days run.
Could you give me some tips to run? Can I ignore the last-train step and align them directly? Or is there a better way?
Best regards,
Please can you show your commands/options for lastdb
and last-train
, and also the version (e.g. lastdb --version
)?
Sure.
Version: lastal 1542
lastdb command: $lastdb -P 20 ${reference_abbre} ${reference_genome}
last-train: ${last_train} -P 20 --revsym -D1e9 --sample-number=5000 ${reference_abbre} ${query_genome_sequence} >${train_outfile}
the distribution of ref genome after remove the TE:
genomic fragment less than 50 bp will be filtered.
Thanks! Since last-train
only uses a sample of the query, I wouldn't expect it to be so slow. I guess the slowness may be caused by running out of memory.
I basically suggest following the "Aligning human & chimp genomes" recipe here: https://gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook.rst
For your huge genomes, I would add this lastdb option: --bits=4
.
In the recipe, -uRY128
reduces the run time and memory use. But it lowers the sensitivity, which is fine for closely-related genomes, but not distantly-related ones. If your genomes are distantly-related, you could try something like -uRY4
or -uRY8
.
I guess it's not necessary to remove TEs (but I don't know for sure).
TE accounts for about 90% of the ref genome, and the removal of TE is for the speed-up of genome alignment. So the slow is unexpected. So, is it faster to use the whole genome with no TE removal?
Sure, I would expect removing 90% TEs to be faster.
However, removing 90% TE is slower for the last-train step, and I have no idea about how to deal with that.
Can I ignore last-train and run the lastal step directly?
Yes, you can ignore last-train, and run lastal directly. Then it will use some default, non-trained parameters. Which might work quite well, or badly, depending on your data.
But last-train should be much faster than the alignment step, whether you remove TEs or not...
(I wouldn't use -D1e9 --sample-number=5000
.)
ok, let me try the last-train without -D1e9 --sample-number=5000
I wouldn't use -D1e9 --sample-number=5000
It's also slow.
I have a train file using the whole reference genome and query genome before. Can I use this train file as the lastal input?
Yes, that train file sounds fine.
Thanks!