cgmlst-dists
cgmlst-dists copied to clipboard
🐻⇔🐨 Calculate distance matrix from ChewBBACA cgMLST allele call tables
cgmlst-dists
Calculate distance matrix from cgMLST allele call tables of ChewBBACA
Quick Start
% cat test/boring.tab
FILE G1 G2 G3 G4 G5 G6
S1 1 INF-2 3 2 1 5
S2 1 1 1 1 NIPH 5
S3 1 2 3 4 1 3
S4 1 LNF 2 4 1 3
S5 1 2 ASM 2 1 3
S6 2 INF-8 3 PLOT3 PLOT5 3
% cgmlst-dists test/boring.tab > distances.tab
This is cgmlst-dists 0.4.0
Loaded 6 samples x 6 allele calls
Calulating distances... 100.00%
Done.
% cat distances.tab
S1 S2 S3 S4 S5
S1 0 3 2 3 1
S2 3 0 4 3 3
S3 2 4 0 1 1
S4 3 3 1 0 1
S5 1 3 1 1 0
S6 3 4 2 2 2
Any allelle calls that are not positive integers are converted to zero. The distance is the hamming distance but with zeroes excluded.
It works by replacing any alphabet characters,
and the strings PLOT5
and PLOT3
with spaces.
It then converts the remaining tab separated
values to integers and ignoring negative signs.
Anything weird is set to zero.
Installation
cgmlst-dists
is written in C and has no other dependencies.
Homebrew
brew install brewsci/bio/cgmlst-dists # COMING IN NOV 2020
Bioconda
conda install -c bioconda cgmlst-dists
Source
git clone https://github.com/tseemann/cgmlst-dists.git
cd cgmlst-dists
make
# run tests
make check
# optionally install to a specific location (default: /usr/local)
make PREFIX=/usr/local install
Options
cgmlst-dists -h
(help)
SYNOPSIS
Pairwise CG-MLST distance matrix from allele call tables
USAGE
cgmlst-dists [options] chewbbaca.tab > distances.tsv
OPTIONS
-h Show this help
-v Print version and exit
-q Quiet mode; do not print progress information
-c Use comma instead of tab in output
-m N Output: 1=lower-tri 2=upper-tri 3=full [3]
-x N Stop calculating beyond this distance [9999]
URL
https://github.com/tseemann/cgmlst-dists
cgmlst-dists -v
(version)
Prints the name and version separated by a space in standard Unix fashion.
cgmlst-dists 0.4.0
cgmlst-dists -q
(quiet mode)
Don't print informational messages, only errors.
cgmlst-dists -c
(CSV mode)
Use a comma instead of a tab in the output table.
cgmlst-dists -m N
(output matrix format)
The output matrix is diagonal symmetric because dist(A,B)=dist(B,A). This means we only calculate half the matrix and mirror it. You can choose to output the lower triangle, upper triangle, or both:
-
-m 1
lower triangle only -
-m 2
upper triangle only -
-m 3
both triangle / full matrix (default)
cgmlst-dists -x N
(short-circuit divergent pairs)
The slowest part of the algorithm is calculating the distance
between two allele vectors. This option will stop comparing as
soon as the distance (differences) exceeds -x
, and return
the distance as -x
.
Issues
Report bugs and give suggesions on the Issues page