merqury
merqury copied to clipboard
k-mer based assembly evaluation
Merqury
Evaluate genome assemblies with k-mers and more
Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual.
The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference.
Merqury provides a set of tools for this purpose.
Dependency
- gcc 7.4 or higher (for installing meryl)
- meryl v1.3
- Java run time environment (JRE)
- R with argparse, ggplot2, and scales (recommend R 4.0.3+)
- bedtools
- samtools
Note that igvtools is no longer used. The .tdf files are replaced with .wig files, compatable to IGV and UCSC genome browser.
Installation
Stable Release
Direct installation
- Get a working Meryl in your PATH
Download meryl release: https://github.com/marbl/meryl/releases/tag/v1.3
tar -xJf meryl-1.3.*.tar.xz
cd meryl-1.3/bin
export PATH=$pwd:$PATH
If the binary doesn't work, download the source and compile:
cd meryl/src
make -j 24
export PATH=/path/to/meryl/…/bin:$PATH
See if we get help message with meryl.
- Download the release version and set env variable
$MERQURY
wget https://github.com/marbl/merqury/archive/v1.3.tar.gz
tar -zxvf v1.3.tar.gz
cd merqury-1.3
export MERQURY=$PWD
Add the “export” part to your environment for both meryl and MERQURY (~/.bash_profile or ~/.profile).
Add installation dir paths for bedtools and samtools to your environment.
source it.
Through Conda
Thanks to @EdHarry, a conda recipe is now available: https://anaconda.org/bioconda/merqury
On a new conda environment, run:
conda install -c conda-forge -c bioconda merqury
Or, if you have a different version of jdk installed or want to have a separate conda environnment for merqury:
conda create -n merqury -c conda-forge -c bioconda merqury openjdk=11
You will then need to activate the merqury environment before using it with:
conda activate merqury
Test running
Rscript $MERQURY/plot/plot_spectra_cn.R --help
In case R complains for version mismatches of the R packages, try
conda update --all
It seems like R in conda isn't maintained anymore. Try to modify channel priority in .condarc.
Run
- !! Merqury assumes all meryl dbs (dirs) are named with
.meryl. !!
On a single machine:
ln -s $MERQURY/merqury.sh # Link merqury
./merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
Usage: merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
<read-db.meryl> : k-mer counts of the read set
<mat.meryl> : k-mer counts of the maternal haplotype (ex. mat.only.meryl or mat.hapmer.meryl)
<pat.meryl> : k-mer counts of the paternal haplotype (ex. pat.only.meryl or pat.hapmer.meryl)
<asm1.fasta> : Assembly fasta file (ex. pri.fasta, hap1.fasta or maternal.fasta)
[asm2.fasta] : Additional fasta file (ex. alt.fasta, hap2.fasta or paternal.fasta)
*asm1.meryl and asm2.meryl will be generated. Avoid using the same names as the hap-mer dbs
<out> : Output prefix
< > : required
[ ] : optional
Example
Below is showing examples how to run Merqury using the prebuilt meryl dbs on a. thaliana F1 hybrid. The fasta files are the trio-binned assemblies from Koren et al.
### Download assemblies ###
wget https://gembox.cbcb.umd.edu/triobinning/athal_COL.fasta
wget https://gembox.cbcb.umd.edu/triobinning/athal_CVI.fasta
### Download prebuilt meryl dbs ###
# read.meryl of the F1 hybrid between COL-0 and CVI-0
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.k18.meryl.tar.gz
# hap-mers for COL-0 haplotype
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.col0.hapmer.meryl.tar.gz
# hap-mers for CVI-0 haplotype
wget https://obj.umiacs.umd.edu/marbl_publications/merqury/athal/a_thal.cvi0.hapmer.meryl.tar.gz
# Untar
for gz in *.tar.gz
do
tar -zxf $gz
done
# Run merqury
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta athal_CVI.fasta test
1. I have one assembly (pseudo-haplotype or mixed-haplotype)
# I don't have the hap-mers
$MERQURY/merqury.sh read-db.meryl asm1.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl athal_COL.fasta test-1
# I have the hap-mers
$MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta test-1
2. I have two assemblies (diploid)
# I don't have the hap-mers
$MERQURY/merqury.sh read-db.meryl asm1.fasta asm2.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl athal_COL.fasta athal_CVI.fasta test-2
# I have the hap-mers
$MERQURY/merqury.sh read-db.meryl mat.meryl pat.meryl asm1.fasta asm2.fasta out_prefix
# Using the example above
$MERQURY/merqury.sh F1.k18.meryl col0.hapmer.meryl cvi0.hapmer.meryl athal_COL.fasta athal_CVI.fasta test-2
- Note there is no need to run merqury per-assemblies. Give two fasta files, Merqury generates stats for each and combined.
How to parallelize
Merqury starts with eval/spectra_cn.sh.
When hap-mers are provided, merqury runs modules under trio/ in addition to eval/spectra_cn.sh.
The following can run at the same time. Modules with dependency are followed by arrows (->).
eval/spectra_cn.sh->trio/spectra_hap.shtrio/hap_blob.shtrio/phase_block.shper assembly ->trio/block_n_stats.sh
Meryl, the k-mer counter inside, uses the maximum cpus available.
Set OMP_NUM_THREADS=24 for example to use 24 threads.
On slurm environment, simply run:
ln -s $MERQURY/_submit_merqury.sh # Link merqury
./_submit_merqury.sh <read-db.meryl> [<mat.meryl> <pat.meryl>] <asm1.fasta> [asm2.fasta] <out>
Change the sbatch to match your environment. (ex. partition)
Outputs from each modules
eval/spectra_cn.sh: k-mer completeness, qv, spectra-cn and spectra-asm plots, asm-only.bedand.tdffor tracking errorseval/qv.sh: just get the qv stats and quit.trio/spectra_hap.sh: hap-mer level spectra-cn plots, hap-mer completenesstrio/hap_blob.sh: blob plots of the hap-mers in each contg/scaffoldtrio/phase_block.sh: phase block statistics, phase block N* plots, hap-mer tracks (.bedand.tdffiles)trio/block_n_stats.sh: continuity plots (phase block N* or NG* plots, phase block vs. contig/scaffold plots)trio/switch_error.sh: this is run part ofphase_blck.sh, however can be re-run with desired short-range switch parameters. Runtrio/block_n_stats.shalong with it to get the associated plots.
Tips for helps
Run each script without any parameters if not sure what to do.
For example, ./trio/switch_error.sh will give a help message and quit.
Following wiki pages have more detailed examples.
1. Prepare meryl dbs (details)
- Get the right k size
- Build k-mer dbs with meryl
- Build hap-mers for trios
2. Overall assembly evaluation (details)
- Reference free QV estimate
- k-mer completeness (recovery rate)
- Spectra copy number analysis
- Track error bases in the assembly
3. Phasing assessment with hap-mers (details)
- Inherited hap-mer plots
- Hap-mer blob plots
- Hap-mer completeness (recovery rate)
- Spectra copy number analysis per hap-mers
- Phased block statistics and switch error rates
- Track each haplotype block in the assembly
Available pre-built meryl dbs
Meryl dbs from Illumina WGS and hapmers are available here for
- A. thaliana COL-0 x CVI-0 F1
- NA12878 (HG001)
- HG002
Citing merqury
Please use this paper to cite Merqury:
Rhie, A., Walenz, B.P., Koren, S. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245 (2020). https://doi.org/10.1186/s13059-020-02134-9