last-genome-alignments
last-genome-alignments copied to clipboard
last-genome-alignments
Here are some pair-wise genome alignments made with LAST.
2023 alignments
The 2023
directory has alignments of these genomes:
Code | Phylum | Animal | Scientific name | Genome |
---|---|---|---|---|
helRob | annelid | jawless leech | Helobdella robusta | GCF_000326865.1 |
hirMed | annelid | medicinal leech | Hirudo medicinalis | GCA_011800805.1 |
lamSat | annelid | Satsuma tubeworm | Lamellibrachia satsuma | GCA_022478865.1 |
oweFus | annelid | shingle tubeworm | Owenia fusiformis | GCA_903813345.2 |
armNas | arthropod | woodlouse | Armadillidium nasatum | GCA_009176605.1 |
cenScu | arthropod | scorpion | Centruroides sculpturatus | GCF_000671375.1 |
droMel | arthropod | fruit fly | Drosophila melanogaster | GCF_000001215.4 |
gloMae | arthropod | millipede | Glomeris maerens | GCA_023279145.1 |
homAme | arthropod | lobster | Homarus americanus | GCF_018991925.1 |
limPol | arthropod | horseshoe crab | Limulus polyphemus | GCF_000517525.1 |
macAtr | arthropod | robber fly | Machimus atricapillus | GCA_933228815.1 |
strMar | arthropod | centipede | Strigamia maritima | GCA_000239455.1 |
linAna | brachiopod | shamisen shell | Lingula anatina | GCF_001039355.2 |
asyLuc | chordate | Bahama lancelet | Asymmetron lucayanum | GCA_001663935.1 |
braFlo | chordate | lancelet | Branchiostoma floridae | GCF_000003815.2 |
calMil | chordate | chimaera | Callorhinchus milii | GCF_018977255.1 |
eptBur | chordate | hagfish | Eptatretus burgeri | GCA_024346535.1 |
homSap | chordate | human | Homo sapiens | hg38_no_alt_analysis_set |
petMar | chordate | lamprey | Petromyzon marinus | GCF_010993605.1 |
acrMil | cnidaria | stony coral | Acropora millepora | GCF_013753865.1 |
actTen | cnidaria | waratah anemone | Actinia tenebrosa | GCF_009602425.1 |
epiPla | cnidaria | zoanthid | Epizoanthus planus | GCA_025388665.1 |
nemVec | cnidaria | starlet sea anemone | Nematostella vectensis | GCF_932526225.1 |
horCal | ctenophore | sea gooseberry | Hormiphora californensis | GCA_020137815.1 |
mneLei | ctenophore | sea walnut | Mnemiopsis leidyi | GCA_000226015.1 |
apoJap | echinoderm | sea cucumber | Apostichopus japonicus | GCA_002754855.1 |
strPur | echinoderm | sea urchin | Strongylocentrotus purpuratus | GCF_000002235.5 |
ptyFla | hemichordate | Hawaiian acorn worm | Ptychodera flava | GCA_001465055.1 |
sacKow | hemichordate | acorn worm | Saccoglossus kowalevskii | GCF_000003605.2 |
aplCal | mollusc | sea hare | Aplysia californica | GCF_000002075.1 |
craGig | mollusc | oyster | Crassostrea gigas | GCF_902806645.1 |
halRuf | mollusc | abalone | Haliotis rufescens | GCF_023055435.1 |
limBul | mollusc | sea butterfly | Limacina bulimoides | GCA_009866985.1 |
mizYes | mollusc | scallop | Mizuhopecten yessoensis | GCF_002113885.1 |
octBim | mollusc | octopus | Octopus bimaculoides | GCF_001194135.2 |
phoLin | mollusc | top snail | Phorcus lineatus | GCA_921293015.1 |
watSci | mollusc | firefly squid | Watasenia scintillans | GCA_015471945.1 |
phoOva | phoronid | horseshoe worm | Phoronis ovalis | GCA_028565635.1 |
The alignments were made with LAST version 1453, like this:
lastdb -P8 -uMAM8 -c myDB genome1.fa
last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train
lastal -P8 -D1e9 -m100 --split-f=MAF+ -p my.train myDB genome2.fa > many-to-one.maf
last-split -r many-to-one.maf > one-to-one.maf
This is currently the recommended way to compare distantly-related genomes, where most of the DNA lacks similarity.
-
-P8
makes it faster by using 8 threads: adjust as suitable for your computer. This has no effect on the results. -
-uMAM8
and-m100
strive for high sensitivity, but use a lot of memory and run time. To go much faster, omit-m100
. To halve the memory use and run time, changeMAM8
toMAM4
. -
--sample-number=5000
makeslast-train
use more samples of genome2, for fear that most of genome2 lacks similarity to genome1. For the same reason,-D1e9
is used withlast-train
, to avoid weak chance similarities more strictly.
2022 alignments
The 2022
directory has various alignments of these genomes:
Genome name | Animal | Source | Assembly name (if different) |
---|---|---|---|
allMis28112v4 | alligator | NCBI | ASM28112v4 |
cerSim1 | rhinoceros | UCSC | |
chrPic3.0.3 | turtle | NCBI | Chrysemys_picta_bellii-3.0.3 |
equCab3 | horse | UCSC | |
hg38 | human | UCSC | hg38.analysisSet |
mOrnAna1.pri.v4 | platypus | NCBI |
They were made with LAST version 1411, using the recipe below under "2021 alignments".
2021 alignments
The 2021
directory has various alignments of these genomes:
Genome name | Animal | Source | Assembly name (if different) |
---|---|---|---|
allMis28112v4 | alligator | NCBI | ASM28112v4 |
Bfl_VNyyK | lancelet | NCBI | |
calMil1 | chimaera | UCSC | |
chrPic3.0.3 | turtle | NCBI | Chrysemys_picta_bellii-3.0.3 |
hg38 | human | UCSC | hg38.analysisSet |
kPetMar1 | lamprey | NCBI | kPetMar1.pri |
latCha1 | coelacanth | UCSC | |
lepOcu1 | gar | NCBI | LepOcu1 |
xenTro10 | frog | NCBI | UCB_Xtro_10.0 |
They can be replicated by running LAST version >= 1387 like this:
lastdb -P8 -uMAM8 myDB genome1.fa
last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train
lastal -P8 -D1e9 -m100 --split-f=MAF+ -p my.train myDB genome2.fa > many-to-one.maf
last-split -r many-to-one.maf | last-postmask > out.maf
-
The
-P8
option makes it faster by using 8 threads: adjust as appropriate for your computer. This has no effect on the results. -
The
-uMAM8
and-m100
strive for high sensitivity, but make thelastal
command use much time and memory, e.g. several days and hundreds of gigabytes. -
You can trade off multi-threading and memory use (with no effect on results), see here.
2017 alignments
Warning: these recipes were for an older version of LAST.
-
Since LAST version 1205,
-R01
has no effect and can be omitted (because it's the default). -
For LAST version >= 1180, it's best to add option
-fMAF+
to the first (many-to-one)last-split
. (In older versions,-fMAF+
was the default.) -
Since LAST version 983,
last-split
option-m1
has no effect and can be omitted (because it's the default).
2017 human-ape alignments
The human genome (hg38) was aligned to chimp (panTro5) and gorilla
(gorGor5), as follows. This alignment recipe is very
accurate-but-slow. A faster recipe would mask repeats during
alignment,
and/or omit -m50
.
First, an "index" of the human genome was prepared, suitable for comparing it to highly-similar sequences:
lastdb -P0 -uNEAR -R01 hg38-NEAR hg38_no_alt_analysis_set.fa
Then, substitution and gap frequencies were determined:
last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-NEAR panTro5.fa > hg38-panTro5.mat
- Human-chimp parameters: hg38-panTro5.mat
- Human-gorilla parameters: hg38-gorGor5.mat
Next, many-to-one ape-to-human alignments were made:
lastal -m50 -E0.05 -C2 -p hg38-panTro5.mat hg38-NEAR panTro5.fa | last-split -m1 > hg38-panTro5-1.maf
The above command was the slowest step (3 CPU-weeks). You can "easily" parallelize it, by processing each sequence within panTro5.fa separately (in parallel). But each process uses quite a lot of memory, so take care that multiple parallel runs don't exceed your memory.
- Human-chimp many-to-one alignments: hg38-panTro5-1.maf.gz
- Human-gorilla many-to-one alignments: hg38-gorGor5-1.maf.gz
Next, one-to-one ape-to-human alignments were made:
maf-swap hg38-panTro5-1.maf |
awk '/^s/ {$2 = (++s % 2 ? "panTro5." : "hg38.") $2} 1' |
last-split -m1 |
maf-swap > hg38-panTro5-2.maf
The awk command prepends the assembly name to each chromosome name (e.g. chr7 -> hg38.chr7).
- Human-chimp one-to-one alignments: hg38-panTro5-2.maf.gz
- Human-gorilla one-to-one alignments: hg38-gorGor5-2.maf.gz
Finally, simple-sequence alignments were discarded, the alignments were converted to tabular format, and alignments with error probability > 10^-5 were discarded:
last-postmask hg38-panTro5-2.maf |
maf-convert -n tab |
awk -F'=' '$2 <= 1e-5' > hg38-panTro5.tab
- Human-chimp tabular alignments: hg38-panTro5.tab.gz (dotplot)
- Human-gorilla tabular alignments: hg38-gorGor5.tab.gz (dotplot)
2017 human-mouse alignments
The human genome (hg38) was aligned to mouse (mm10). This alignment recipe is even more slow-and-sensitive.
First, an "index" of the human genome was prepared, suitable for comparing it to less-similar sequences:
lastdb -P0 -uMAM4 -R01 hg38-MAM4 hg38_no_alt_analysis_set.fa
Then, substitution and gap frequencies were determined:
last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-MAM4 mm10.fa > hg38-mm10.mat
- Human-mouse parameters: hg38-mm10.mat
- Human-dog parameters: hg38-canFam3.mat
Next, many-to-one mouse-to-human alignments were made:
lastal -m100 -E0.05 -C2 -p hg38-mm10.mat hg38-MAM4 mm10.fa | last-split -m1 > hg38-mm10-1.maf
- Human-mouse many-to-one alignments: hg38-mm10-1.maf.gz
Finally, one-to-one MAF alignments, and high-confidence tabular alignments, were made in the same way as above.
-
Human-mouse one-to-one alignments: hg38-mm10-2.maf.gz
-
Human-mouse tabular alignments: hg38-mm10.tab.gz (dotplot)