hts-nim-tools icon indicating copy to clipboard operation
hts-nim-tools copied to clipboard

example tools

Open brentp opened this issue 7 years ago • 12 comments
trafficstars

It would be nice to have a suite of example tools to further show off the syntax and utility of hts-nim. Please leave a comment here with a description.

The strengths of hts-nim are speed, so things that are too slow in python would be good candidates. It has very good syntax for VCF and BAM.

brentp avatar May 01 '18 15:05 brentp

a VCF QC tool that takes a multi-sample VCF and finds big contiguous chunks where a sample has all missing genotypes could indicate that a processing chunk (e.g. from a single-sample GVCF) was lost. This could output a plotly plot showing the rate of missingness across each chrom for each sample.

brentp avatar May 01 '18 15:05 brentp

VCF-VCF comparison, to determine concordance (given two orthogonal VCFs) or sens/spec (given a truth set).

jdidion avatar May 02 '18 00:05 jdidion

A pileup engine! I'll admit that doesn't really count as a tool but would enable the implementation of many.

andreas-wilm avatar May 02 '18 05:05 andreas-wilm

These are good ideas--though maybe more than bite-sized. :) I'll be writing a pileup engine soonish. I like the idea of VCF-to-VCF comparison, but there are a lot of considerations for that -- does it do normalization, does it do each sample? etc? I'll have to thnk more about a common use-case

brentp avatar May 02 '18 13:05 brentp

Here's one that I ported over from a bcftools plugin I wrote a while ago. I'm sure there are a few things that I should re-optimize for nim.

tabulate.nim bins and tabulates integer data from the genotype column in a VCF, like DP, MQ, etc.

See also tabulate.c, the bcftools plugin.

dmckean avatar May 09 '18 22:05 dmckean

A sambamba slice, pretty please? :D

brainstorm avatar Jul 12 '18 02:07 brainstorm

@brainstorm what would it do that sambamba slice does not?

I guess there could be an API-ish way to access uncompressed blocks.

brentp avatar Jul 12 '18 15:07 brentp

@brentp Sambamba is in a delicate spot when it comes to release engineering/CI/CD and overall reliability right now. slice is on 0.6.7, unreleasable for several platforms automatically in bioconda at the moment of writing this because it has several underlying issues (TL;DR: rabbit hole of fixes/updates):

https://twitter.com/braincode/status/1017216491265986560

I just wanted to have something functional/deployable to drop into our pipes ;)

brainstorm avatar Jul 13 '18 08:07 brainstorm

I'll have a look. Not sure if htslib even exposes this--I think sambamba could do it because it has its own wrappers.

brentp avatar Jul 13 '18 14:07 brentp

@brainstorm I was crawling through source for bcftools concat --naive when I remembered this thread.

bcftools/vcfconcat.c::naive_concat checks each BGZF block header ( at least to get the block size) and copies each block w/out decompressing. EOF blocks for each file are omitted, a single EOF block is placed at close.

Bcftools' naive concatenation trusts that all input VCFs have compatible headers and writes out the first file's VCF header without checks. Naturally all input must be gzipped, but all input files must either be BCF or VCF.

@brentp I don't think that htslib provides an API for this beyond bgzf_raw_read, bgzf_raw_write, and other functions in htslib/bgzf.h and htslib/hfile.h, at least for the block IO. Do you thinks those two functions would be useful enough to be worth adding to hts_concat.h and bgzf.nim?

I'm not certain if htslib exposes block positions in the result from a hts_itr_query call, but that function calls a lot of helpers that may do most of the legwork for helping get a list of blocks for a queried region.

dmckean avatar Aug 08 '18 18:08 dmckean

yes, if those are in the exposed api (in the header) files, then they can be added to hts_concat.h, if you or someone else wants to write an example or an hts-nim api on how to use the raw_read/write to get uncompressed blocks from a given region then it would be a great addtion to bgzf.nim.

brentp avatar Aug 09 '18 02:08 brentp

Some other bigger ideas but I may try to tackle a few:

  1. Replacement for picard MarkDuplicates. This is a huge bottleneck for our pipeline. I may try to tackle.
  2. window - generates regions from a sam / vcf for piping parallelizing in programs (e.g. parallelizing variant calling by chrom or region). I'm planning on adding this to seq-collection. Very useful with GNU-parallel.
  3. Extracting reads that map to different contigs.
  4. Extract read pairs that overlapr; Could you use this to estimate errors/artifacts?

danielecook avatar Nov 14 '19 17:11 danielecook