kmer-db
kmer-db copied to clipboard
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
Kmer-db
Kmer-db is a fast and memory-efficient tool for large-scale k-mer analyses (indexing, querying, estimating evolutionary relationships, etc.).
Quick start
git clone https://github.com/refresh-bio/kmer-db
cd kmer-db && make
INPUT=./test/virus
OUTPUT=./output
mkdir $OUTPUT
# build a database from all 18-mers (default) contained in a set of sequences
./kmer-db build $INPUT/seqs.part1.list $OUTPUT/k18.db
# establish numbers of common k-mers between new sequences and the database
./kmer-db new2all $OUTPUT/k18.db $INPUT/seqs.part2.list $OUTPUT/n2a.csv
# calculate jaccard index from common k-mers
./kmer-db distance $OUTPUT/n2a.csv
# extend the database with new sequences
./kmer-db build -extend $INPUT/seqs.part2.list $OUTPUT/k18.db
# establish numbers of common k-mers between all sequences in the database
./kmer-db all2all $OUTPUT/k18.db $OUTPUT/a2a.csv
# build a database from 10% of 25-mers using 16 threads
./kmer-db build -k 25 -f 0.1 -t 16 $INPUT/seqs.part1.list $OUTPUT/k25.db
# establish number of common 25-mers between single sequence and the database
# (minhash filtering that retains 10% of MT159713 k-mers is done automatically prior to the comparison)
./kmer-db one2all $OUTPUT/k25.db $INPUT/data/MT159713.fasta $OUTPUT/MT159713.csv
Table of contents
- Installation
-
Usage
- Building a database
- Counting common k-mers
- Calculating similarities or distances
- Storing minhashed k-mers
- Datasets
1. Installation
Kmer-db comes with a set of precompiled binaries for Linux, OS X, and Windows. The software is also available on Bioconda:
conda install -c bioconda kmer-db
For detailed instructions how to set up Bioconda, please refer to the Bioconda manual. Kmer-db can be also built from the sources distributed as:
- MAKE project (G++ 5.5.0 tested) for Linux and OS X,
- Visual Studio 2015 solution for Windows.
zlib linking
Kmer-db uses zlib for handling gzipped inputs. Under Linux, the software is by default linked against system-installed zlib. Due to issues with some library versions, precompiled zlib is also present the repository. In order to use it, one needs to modify variable INTERNAL_ZLIB at the top of the makefile. Under Windows, the repository library is always used.
AVX and AVX2 support
Kmer-db, by default, takes advantage of AVX (required) and AVX2 (optional) CPU extensions. The pre-built binary determines supported instructions at runtime, thus it is multiplatform. When compiling the sources under Linux and OS X, the support of AVX2 is also established automatically. Under Windows, the program is by default built with AVX2 instructions. To prevent this, Kmer-db must be compiled with NO_AVX2 symbolic constant defined.
2. Usage
kmer-db <mode> [options] <positional arguments>
Kmer-db operates in one of the following modes:
-
build
- building a database from samples, -
all2all
- counting common k-mers - all samples in the database, -
new2all
- counting common k-mers - set of new samples versus database, -
one2all
- counting common k-mers - single sample versus database, -
distance
- calculating similarities/distances, -
minhash
- storing minhashed k-mers,
Common options:
-
-t <threads>
- number of threads (default: number of available cores),
The meaning of other options and positional arguments depends on the selected mode.
2.1. Building a database
Construction of k-mers database is an obligatory step for further analyses. The procedure accepts several input types:
-
compressed or uncompressed genomes/reads:
kmer-db build [-k <kmer-length>] [-f <fraction>] [-multisample-fasta] [-extend] [-t <threads>] <sample_list> <database>
-
KMC-generated k-mers:
kmer-db build -from-kmers [-f <fraction>] [-extend] [-t <threads>] <sample_list> <database>
-
minhashed k-mers produced by
minhash
mode:kmer-db build -from-minhash [-extend] [-t <threads>] <sample_list> <database>
Parameters:
-
sample_list
(input) - file containing list of samples in the following format:
By default, the tool requires uncompressed or compressed FASTA files for each sample. If a file on the list cannot be found, the package tries adding the following extensions: fna, fasta, gz, fna.gz, fasta.gz . Whensample_file_1 sample_file_2 sample_file_3 ...
-from-kmers
switch is specified, corresponding KMC-generated k-mer files (.kmc_pre and .kmc_suf) are required. If-from-minhash
switch is present, minhashed k-mer files (.minhash) must be generated byminhash
command prior to the database construction. Note, that minhashing may be also done during the database construction by specyfying-f
option. -
database
(output) - file with generated k-mer database. -
-k <kmer-length>
- length of k-mers (default: 18); ignored when-from-kmers
or-from-minhash
switch is specified. -
-f <fraction>
- fraction of all k-mers to be accepted by the minhash filter during database construction (default: 1); ignored when-from-minhash
switch is present. -
-multisample-fasta
- each sequence in a FASTA file is treated as a separate sample, -
-extend
- extend the existing database with new samples, -
-t <threads>
- number of threads (default: number of available cores).
2.2. Counting common k-mers
Samples in the database against each other:
kmer-db all2all [-buffer <size_mb>] [-sparse] [-t <threads>] <database> <common_table>
Parameters:
-
database
(input) - k-mer database file created bybuild
mode, -
common_table
(output) - file containing table with common k-mer counts. -
-buffer <size_mb>
- size of cache buffer in megabytes; use L3 size for Intel CPUs and L2 for AMD for best performance; default: 8 -
-sparse
- stores output matrix in a sparse form, -
-t <threads>
- number of threads (default: number of available cores).
New samples against the database:
kmer-db new2all [-multisample-fasta | -from-kmers | -from-minhash] [-sparse] [-t <threads>] <database> <sample_list> <common_table>
Parameters:
-
database
(input) - k-mer database file created bybuild
mode. -
sample_list
(input) - file containing list of samples in one of the supported formats (seebuild
mode); if samples are given as genomes (default) or k-mers (-from-kmers
switch), the minhashing is done automatically with the same filter as in the database. -
common_table
(output) - file containing table with common k-mer counts. -
-multisample-fasta
/-from-kmers
/-from-minhash
- seebuild
mode for details. -
-sparse
- stores output matrix in a sparse form, -
-t <threads>
- number of threads (default: number of available cores).
Single sample against the database:
kmer-db one2all [-from-kmers | -from-minhash] [-t <threads>] <database> <sample> <common_table>
The meaning of the parameters is the same as in new2all
mode, but instead of specifying file with sample list, a single sample file is used as a query.
Output format
Modes all2all
, new2all
, and one2all
produce a comma-separated table with numbers of common k-mers. The table is by default stored in a dense form:
kmer-length: k fraction: f | db-samples | s1 | s2 | ... | sn |
query-samples | total-kmers | |s1| | |s2| | ... | |sn| |
q1 | |q1| | |q1 ∩ s1| | |q1 ∩ s2| | ... | |q1 ∩ sn| |
q2 | |q2| | |q2 ∩ s1| | |q2 ∩ s2| | ... | |q2 ∩ sn| |
... | ... | ... | ... | ... | ... |
qm | |qm| | |qm ∩ s1| | |qm ∩ s2| | ... | |qm ∩ sn| |
where:
- k - k-mer length,
- f - minhash fraction (1, when minhashing is disabled),
- s1, s2, ..., sn - database sample names,
- q1, q2, ..., qm - query sample names,
- |a| - number of k-mers in sample a,
- |a ∩ b| - number of k-mers common for samples a and b.
For performance reasons, all2all
mode produces a lower triangular matrix.
When -sparse
switch is specified, the table is stored in a sparse form. In particular, zeros are omitted while non-zero elements are represented as pairs (column_id: value) with 1-based column indexing. Thus, rows may have different number of elements, e.g.:
kmer-length: k fraction: f | db-samples | s1 | s2 | ... | sn |
query-samples | total-kmers | |s1| | |s2| | ... | |sn| |
q1 | |q1| | i11: |q1 ∩ si11| | i12: |q1 ∩ si12| | ||
q2 | |q2| | i21: |q2 ∩ si21| | i22: |q2 ∩ si22| | i23: |q2 ∩ si23| | |
q2 | |q2| | ||||
... | ... | ... | |||
qm | |qm| | im1: |qm ∩ sim1| |
2.3. Calculating similarities or distances
kmer-db distance [<measures>] [-sparse [-above <a_th>] [-below <b_th>]] <common_table>
Parameters:
-
common_table
(input) - file containing table with numbers of common k-mers produced byall2all
,new2all
, orone2all
mode (both, dense and sparse matrices are supported). -
measures
- names of the similarity/distance measures to be calculated, can be one or several of the following (is not specified,jaccard
is used):-
jaccard
: $J(q,s) = |p \cap q| / |p \cup q|$, -
min
: $\min(q,s) = |p \cap q| / \min(|p|,|q|)$, -
max
: $\max(q,s) = |p \cap q| / \max(|p|,|q|)$, -
cosine
: $\cos(q,s) = |p \cap q| / \sqrt{|p| \cdot |q|}$, -
mash
(Mash distance): $\textrm{Mash}(q,s) = -\frac{1}{k}ln\frac{2 \cdot J(q,s)}{1 + J(q,s)}$ -
ani
(average nucleotide identity): $\textrm{ANI}(q,s) = 1 - \textrm{Mash}(p,q)$
-
-
-phylip-out
- store output distance matrix in a Phylip format, -
-sparse
- outputs a sparse matrix (independently of the input matrix format), -
-above <a_th>
- retains elements larger then <a_th>, -
-below <b_th>
- retains elements smaller then <b_th>.
This mode generates a file with similarity/distance table for each selected measure. Name of the output file is produced by adding to the input file an extension with a measure name.
2.4. Storing minhashed k-mers
This is an optional analysis step which stores minhashed k-mers on the hard disk to be later consumed by build
, new2all
, or one2all
modes with -from-minhash
switch. It can be skipped if one wants to use all k-mers from samples for distance estimation or employs minhashing during database construction. Syntax:
kmer-db minhash [-k <kmer-length>] [-multisample-fasta] <fraction> <sample_list>
kmer-db minhash -from-kmers <fraction> <sample_list>
Parameters:
-
fraction
(input) - fraction of all k-mers to be accepted by the minhash filter. -
sample_list
(input) - file containing list of samples in one of the supported formats (seebuild
mode). -
-k <kmer-length>
- length of k-mers (default: 18; maximum: 30); ignored when-from-kmers
switch is specified. -
-multisample-fasta
/-from-kmers
- seebuild
mode for details.
For each sample from the list, a binary file with .minhash extension containing filtered k-mers is created.
3. Datasets
List of the pathogens investigated in Kmer-db study can be found here