pa-bench
pa-bench copied to clipboard
Benchmarking pairwise aligners
Pairwise Alignment Benchmarks
This repository contains a few things:
pa-wrapper: a wrapper library around pairwise aligners;pa-bin: a unified command line tool to call these aligners;pa-bench: a tool to benchmark aligners against each other;evals/astarpa: experiments and analysis for A*PA.evals/astarpa2: experiments and analysis for A*PA2.
pa-wrapper: Wrapper API
pa-wrapper contains a unified API to a number of aligners:
A*PA, A*PA2 Block Aligner, Edlib, Ksw2, Parasail, Triple Accel, [Bi]Wfa
Create an AlignerParams object and call
build_aligner() on it to obtain an instance of an aligner, on which .align()
can be called repeatedly.
Adding an aligner
To add an aligner, update `pa-wrapper/Cargo.toml` and `pa-wrapper/src/lib.rs`, and add a new file `pa-wrapper/src/wrappers/pa-bin: Unified binary
Use cargo run --bin pa-bin -- <arguments> input/file/or/dir.{txt,seq,fa} to run any of the supported aligners
on some input.
Succinct help of pa-bin (see --help for more):
CLI tool that wraps other aligners and runs them on the given input
Usage: pa-bin [OPTIONS] <--aligner <ALIGNER>|--params <PARAMS>|--params-file <PATH>|--print-params <ALIGNER>> [INPUT] [OUTPUT]
Arguments:
[INPUT] (Directory of) .seq, .txt, or Fasta files with sequence pairs to align
[OUTPUT] Write a .csv of `{cost},{cigar}` lines. Defaults to input file with .csv extension
Options:
--cost-only Return only cost (no traceback)
--silent Do not print anything to stderr
-h, --help Print help (see more with '--help')
Aligner:
--aligner <ALIGNER> The aligner to use with default parameters [possible values: astar-nw, astar-pa,
block-aligner, edlib, ksw2, triple-accel, wfa]
--params <PARAMS> Yaml/json string of aligner parameters
--params-file <PATH> File with aligner parameters
--print-params <ALIGNER> Print default parameters for the given aligner [possible values: astar-nw, astar-pa,
block-aligner, edlib, ksw2, triple-accel, wfa]
--json The parameters are json instead of yaml
Cost model:
--sub <COST> Substitution cost, (> 0) [default: 1]
--open <COST> Gap open cost (>= 0) [default: 0]
--extend <COST> Gap extend cost (> 0) [default: 1]
The aligner to run can be specified with --aligner <ALIGNER> for default
arguments, or --params[-file] to read a (yaml or json) string of parameters
(from a file). Use --print-params <ALIGNER> to get default parameters that can
be modified.
pa-bench: Benchmarking
For benchmarking, see input format, usage, and quick start below.
Benchmarking is done using jobs. Each job consists on an input dataset (a
.seq file), a cost model, and a tool with parameters.
The pa-bench binary calls itself (recursively) for each job to measure time
and memory usage.
An experiment consists of a yaml input configuration file is used to specify the list
of jobs to run.
Results are incrementally accumulated in a json results file.
Quick start
The easiest way to get started is probably to first clone (and fork) the repository. Then, you can copy either:
- The
evals/astarpadirectory with all experiments (*.yaml) and analysis/plots (evals.ipynb) used in the A*PA paper. - The
evals/astarpa-nextdirectory that specifically tests new versions of A*PA on some datasets of ultra long ONT reads of human data. This contains the code to plot boxplots+swarmplots of the distribution of runtimes on a dataset. - Or you can modify/add experiments to
evals/experiments/and useevals/evals.ipynb.
If you think your experiments, analysis, and/or plots are generally useful and interesting, feel free to make a PR to add them here.
Benchmarking features
Main settings
- Time limit: Use
--time-limit 1hto limit each run to1hour usingulimit. - Memory: Use
--mem-limit GiBto limit each run to1GiBof total memory usingulimit. - Nice: Use
--nice=-20to increase the priority of each runner job. This requires root. (See the end of this file.) - Parallel running: Use
-j 10to run10jobs in parallel. Each job is pinned to a different core. - Pinning: By default, each job is fixed to run on a single core. This
doesn't work on all OSes and can crash/
Panicthe runner. Use--no-pinto avoid this. - Incremental running: By default, jobs results already present
in the target
jsonfile are reused. With--rerun-failed, failed jobs are retried, and with--rerun-all, all jobs are rerun.--cleancompletely removes the cache.
Debugging failing jobs
- To see which jobs are being run, use
--verbose. - To see the error output of runners, use
--stderr. This should be the first thing to do to figure out why jobs are failing.
Output
Output is written to a json file, and also written to a cache that can be
reused across experiments.
- Runtime of processing input pairs, excluding startup and file io time.
- Maximum memory usage (max rss), excluding the memory usage of the input data.
- Start and end time of job, for logging purposes.
- CPU frequency at start and end of job, as a sanity check.
Other
- Skipping: When a job fails, all larger jobs (larger
nore) are automatically skipped. - Interrupting: You can interrupt a run at any time with
ctrl-C. This will stop ongoing jobs and write results so far to disk. - Cigar checking: When traceback is enabled, all Cigar strings are checked to see whether they are valid and have the right cost.
- Cost checking: The cost returned by exact aligners is cross-validated. For inexact aligners, the fraction of correct results is computed.
Input format
The input is specified as a yaml file containing:
- datasets: file paths or settings to generate datasets;
- traces: whether each tool computes a path or only the edit distance;
- costs: the cost models to run all aligners on;
- algos: the algorithms (aligners with parameters) to use.
A job is created for the each combination of the 4 lists.
Examples can be found in evals/experiments/. Here is one:
datasets:
# Hardcoded data
- !Data
- - CGCTGGCTGCTGCCACTAACTCCGTATAGTCTCACCAAGT
- CGCTGGCTCGCCTGCCACGTAACTCCGTATAGTCTCACCAACTGTCAGTT
- - AACCAGGGTACACCGACTAATCCACGCACAAGTTGGGGTC
- ACAGGTACACCACTATCACGACAAGTTGGGTC
# Path to a single .seq file, relative to `evals/data`
- !Path path/to/sequences.seq
# Recursively finds all non-hidden .seq files in a directory, relative to `evals/data`
- !Path path/to/directory
# Download and extract a zip file containing .seq files to `evals/data/download/{dir}`
- !Download
url: https://github.com/pairwise-alignment/pa-bench/releases/download/datasets/ont-500k.zip
dir: ont-500k
# Generated data in `evals/data/generated/`
- !Generated # Seed for the RNG.
seed: 31415
# The approximate total length of the input sequences.
total_size: 100000
# The error models to use. See pa-generate crate for more info:
# https://github.com/pairwise-alignment/pa-generate
error_models:
# Uniform, NoisyInsert, NoisyDelete, NoisyMove, NoisyDuplicate, SymmetricRepeat
- Uniform
error_rates: [0.01, 0.05, 0.1, 0.1]
lengths: [100, 1000, 10000, 100000]
# Run both with and without traces
traces: [false, true]
costs:
# unit costs
- { sub: 1, open: 0, extend: 1 }
# affine costs
- { sub: 1, open: 1, extend: 1 }
algos:
- !BlockAligner
size: !Size [32, 8192]
- !ParasailStriped
- !Edlib
- !TripleAccel
- !Wfa
memory_model: !MemoryUltraLow
heuristic: !None
- !Ksw2
method: !GlobalSuzukiSse
band_doubling: false
- !AstarPa
Usage
- Clone this repo and make sure you have Rust installed.
- Run
cargo run --release -- [--release] evals/experiments/test.yamlfrom the root. - In case of errors, add
--verboseto see which jobs are being run, and/or--stderrto see the output of failing (Result: Err(Panic)) jobs. For non-linus OSes, you may need to add--no-binto disable pinning to specific cores.
First, this will generate/download required input data files in evals/data.
Results are written to evals/results/test.json and a cache of all (outdated)
jobs for the current experiment is stored in evals/results/test.cache.json or
at the provided --cache.
Succinct help of pa-bench (see --help for more):
Usage: pa-bench bench [OPTIONS] [EXPERIMENTS]...
Arguments:
[EXPERIMENTS]... Path to an experiment yaml file
Options:
-o, --output <OUTPUT> Path to the output json file. By default mirrors the `experiments` dir in `results`
--cache <CACHE> Shared cache of JobResults. Default: <experiment>.cache.json
--no-cache Completely disable using a cache
-j <NUM_JOBS> Number of parallel jobs to use [default: 5]
--rerun-all Ignore job cache, i.e. rerun jobs already present in the results file
--rerun-failed Rerun failed jobs that are otherwise reused
--release Shorthand for '-j1 --nice=-20'
-h, --help Print help (see more with '--help')
Limits:
-t, --time-limit <TIME_LIMIT> Time limit. Defaults to value in experiment yaml or 1m
-m, --mem-limit <MEM_LIMIT> Memory limit. Defaults to value in experiment yaml or 1GiB
--nice <NICE> Process niceness. '--nice=-20' for highest priority
--no-pin Disable pinning, which may not work on OSX
Output:
-v, --verbose Print jobs started and finished
--stderr Show stderr of runner process
Notes on benchmarking
Niceness.
Changing niceness to -20 (the highest priority) requires running pa-bench as root. Alternatively, you could add the following line to
/etc/security/limits.conf to allow your user to use lower niceness values:
<username> - nice -20
Pinning.
Pinning jobs to cores probably only works on linux. On other systems, benchmarking
will crash and will report Result: Err(Panic). Use --no-pin to avoid this.
CPU Settings. Make sure to
- fix the cpu frequency using
cpupower frequency-set -d 2.6GHz -u 2.6GHz -g powersave(powersavecan give more consistent results thanperformance), - disable hyperthreading,
- disable turbo-boost,
- disable power saving,
- the laptop is fully charged and connected to power.
Datasets
Datasets are available in the datasets release.
Code layout
From low-level to higher, the following crates are relevant:
pa-types: Basic pairwise alignment types such asSeq,Pos,CostandCigar.pa-generate: A utility to generate sequence pairs with various kinds or error types.pa-wrappercontains anAlignerTraitand implements this uniform interface for all aligners. Each aligner is behind a feature flag. Parasailors is disabled by default do reduce the otherwise large build time.pa-binis a thin binary/CLI aroundpa-wrappers.pa-bench-typescontains the definition of aExperiment,Dataset,Job,JobResult, and theAlgorithmParamsenum that selects the algorithm to run and its parameters. This causespa-bench-typesto have dependencies on crates that contain aligner-specific parameter types.pa-benchcontains a binary that collects all jobs in an experiment and calls itself once per job.