vcfpp
vcfpp copied to clipboard
a C++ API of htslib to be easily integrated and safely used. More importantly, it can be callled seamlessly in R/Python/Julia etc.
#+TITLE: vcfpp: a single C++ file for manipulating VCF/BCF
[[https://github.com/Zilong-Li/vcfpp/actions/workflows/linux.yml/badge.svg]] [[https://github.com/Zilong-Li/vcfpp/actions/workflows/mac.yml/badge.svg]] [[https://zilongli.org/proj/vcfpp/index.html][https://img.shields.io/badge/Documentation-latest-blue.svg]] [[https://github.com/Zilong-Li/vcfpp/releases/latest][https://img.shields.io/github/v/release/Zilong-Li/vcfpp.svg]] [[https://img.shields.io/github/license/Zilong-Li/vcfpp?style=plastic.svg]]
This project introduces vcfpp (vcf plus plus), a single C++ file as interface to the basic =htslib=, which can be easily included in a C++ program for scripting high-performance genomic analyses.
Features:
- single file to be easily included and compiled
- easy and safe [[https://zilongli.org/proj/vcfpp/index.html][API]] to use.
- objects are RAII. no worry about allocate and free memory.
- has the full functionalities of the =htslib=, eg. supports of compressed VCF/BCF and URL link as filename.
- compatible with C++11 and later
- Table of Contents :TOC:QUOTE: #+BEGIN_QUOTE
- [[#installation][Installation]]
- [[#usage][Usage]]
- [[#reading-vcf][Reading VCF]]
- [[#genotypes-coding][Genotypes Coding]]
- [[#writing-vcf][Writing VCF]]
- [[#variants-operation][Variants Operation]]
- [[#header-operation][Header Operation]]
- [[#working-with-r][Working with R]]
- [[#working-with-python][Working with Python]]
- [[#command-line-tools][Command Line Tools]] #+END_QUOTE
- Installation
- install [[https://github.com/samtools/htslib][htslib]] on your system
- download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]]
- put =vcfpp.h= in the same folder as your cpp source file or a folder say =/my/writable/path/= or the system path
- Usage
The documentation of API is [[https://zilongli.org/proj/vcfpp/index.html][here.]]
** Reading VCF
In this example, we count the number of heterozygous genotypes for each sample in all records. You can paste the example code into a =example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts=. You can replace =-I.= with =-I/my/writable/path/= if you put =vcfpp.h= there.
#+begin_src C++
#include "vcfpp.h"
using namespace std;
using namespace vcfpp;
int main(int argc, char* argv[])
{
// read data from 1000 genomes server
BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
BcfRecord var(vcf.header); // construct a variant record
vector
** Genotypes Coding
There are 3 types used for genotypes, ie. =vector
*** Code genotypes with missing allele as heterozygous.
If you use =vector
*** Code missing allele as -9
If this default behavior for =vector
** Writing VCF
There are many ways in vcfpp for writing the VCF/BCF file.
*** Use an empty template
Here we construct an initial BCF with header using VCF4.3 specification. Next we add meta data in the header and write out variant record given a string.
#+begin_src C++ BcfWriter bw("out.bcf.gz", "VCF4.3"); bw.header.addFORMAT("GT", "1", "String", "Genotype"); bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)"); bw.header.addContig("chr20"); // add chromosome for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s); // add 3 samples bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0"); #+end_src
*** Use another VCF as template
In this example, we first read VCF file =test/test-vcf-read.vcf.gz=. Secondly, we construct an empty variant record and update the record with the input VCF. Thirdly, we construct a BcfWriter object using the meta data in the header of the input VCF, writing out the header and the modified variant record.
#+begin_src C++ BcfReader br("test/test-vcf-read.vcf.gz"); BcfRecord var(br.header); br.getNextVariant(var); BcfWriter bw("out.vcf.gz", br.header); bw.writeHeader(); var.setPOS(100001); // update the POS of the variant bw.writeRecord(var); #+end_src
** Variants Operation
All variants related API can be found [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_record][BcfRecord]]. The commonly used are listed below.
#+begin_src C++
BcfReader vcf("bcf.gz"); // construct a vcf reader
BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header
vcf.getNextVariant(var) // get next variant
vector
** Header Operation
All variants related API can be found in [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_header][BcfHeader]].
- Working with R
Examples of vcfpp working with R are in folder [[Rcpp]] and https://github.com/Zilong-Li/vcfppR.
- Working with Python
Examples of vcfpp working with Python are in folder [[Pybind11]].
- Command Line Tools
Find more useful command line tools in folder [[tools]].