KmerStream
KmerStream copied to clipboard
Streaming algorithm for computing kmer statistics for massive genomics datasets
KmerStream
Streaming algorithm for computing kmer statistics for massive genomics datasets.
Installation
To compile just type make
Running
To see the usage just type KmerStream
KmerStream 1.1
Estimates occurrences of k-mers in fastq or fasta files and saves results
Usage: KmerStream [options] ... FASTQ files
-k, --kmer-size=INT Size of k-mers, either a single value or comma separated list
-q, --quality-cutoff=INT Comma separated list, keep k-mers with bases above quality threshold in PHRED (default 0)
-o, --output=STRING Filename for output
-e, --error-rate=FLOAT Error rate guaranteed (default value 0.01)
-t, --threads=INT Number of threads to use (default value 1)
-s, --seed=INT Seed value for the randomness (default value 0, use time based randomness)
-b, --bam Input is in BAM format (default false)
--binary Output is written in binary format (default false)
--tsv Output is written in TSV format (default false)
--verbose Print lots of messages during run
--online Prints out estimates every 100K reads
--q64 set if PHRED+64 scores are used (@...h) default used PHRED+33
Options:
-kthe k-mer size, this should be an integer or a list of integers e.g.-k 31or-k 31,47,63, odd values behave better than even values-qoptional quality cutoff values, all k-mers with bases under theqthreshold are discarded-ofilename where the output should be written-eguarantee on the error of the estimator used, default value is 1%, lower values increase memory usage-tnumber of threads to use-sKmerStream uses random hash functions for computing the statistics, to fix the hash value for reproducibility set the seed to a fixed value, e.g. '-s 42'-bInput is in BAM format--binaryWrite output in binary format, this includes the data necessary for runningKmerStreamJoin, the output filename is used as a prefix and the file containing the output isPREFIX+_Q_0_k_31--tsvWrite output in TSV (tab separated values) format for easier parsing--onlineprints estimates every 100K reads, see (https://pmelsted.wordpress.com/2014/07/12/analyzing-data-while-downloading/)[https://pmelsted.wordpress.com/2014/07/12/analyzing-data-while-downloading/] for example usage--q64Quality values are enchoded in PHRED+64 format rather than the default PHRED+33, use this if your quality values are from@tohrather than!toI
KmerStreamJoin
KmerStreamJoin 1.1
Creates union of many stream estimates
Usage: KmerStreamJoin -o output files ...
KmerStreamJoin merged-file
-o, --output=STRING Filename for output
--verbose Print output at the end
KmerStreamJoin, when run with the -o option takes a list of KmerStream binary output files (created with --binary option to KmerStream) and creates a single binary output file that is equivalent to having run a single KmerStream run on all of the files. When the -o option is missing it outputs the KmerStream result of the binary input file.
This utility is useful when distributing the process of creating the binary files or computed incrementally.
KmerStreamEstimate.py
KmerStreamEstimate is a python script that reads a tsv file as input (generated using --tsv) and estimates the genome size (G), error rate (e), and coverage (lambda).