vcfpp icon indicating copy to clipboard operation
vcfpp copied to clipboard

a C++ API of htslib to be easily integrated and safely used. More importantly, it can be callled seamlessly in R/Python/Julia etc.

#+TITLE: vcfpp: a single C++ file for manipulating VCF/BCF

[[https://github.com/Zilong-Li/vcfpp/actions/workflows/linux.yml/badge.svg]] [[https://github.com/Zilong-Li/vcfpp/actions/workflows/mac.yml/badge.svg]] [[https://zilongli.org/proj/vcfpp/index.html][https://img.shields.io/badge/Documentation-latest-blue.svg]] [[https://github.com/Zilong-Li/vcfpp/releases/latest][https://img.shields.io/github/v/release/Zilong-Li/vcfpp.svg]] [[https://img.shields.io/github/license/Zilong-Li/vcfpp?style=plastic.svg]]

This project introduces vcfpp (vcf plus plus), a single C++ file as interface to the basic =htslib=, which can be easily included in a C++ program for scripting high-performance genomic analyses.

Features:

  • single file to be easily included and compiled
  • easy and safe [[https://zilongli.org/proj/vcfpp/index.html][API]] to use.
  • objects are RAII. no worry about allocate and free memory.
  • has the full functionalities of the =htslib=, eg. supports of compressed VCF/BCF and URL link as filename.
  • compatible with C++11 and later
  • Table of Contents :TOC:QUOTE: #+BEGIN_QUOTE
  • [[#installation][Installation]]
  • [[#usage][Usage]]
    • [[#reading-vcf][Reading VCF]]
    • [[#genotypes-coding][Genotypes Coding]]
    • [[#writing-vcf][Writing VCF]]
    • [[#variants-operation][Variants Operation]]
    • [[#header-operation][Header Operation]]
  • [[#working-with-r][Working with R]]
  • [[#working-with-python][Working with Python]]
  • [[#command-line-tools][Command Line Tools]] #+END_QUOTE
  • Installation
  1. install [[https://github.com/samtools/htslib][htslib]] on your system
  2. download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]]
  3. put =vcfpp.h= in the same folder as your cpp source file or a folder say =/my/writable/path/= or the system path
  • Usage

The documentation of API is [[https://zilongli.org/proj/vcfpp/index.html][here.]]

** Reading VCF

In this example, we count the number of heterozygous genotypes for each sample in all records. You can paste the example code into a =example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts=. You can replace =-I.= with =-I/my/writable/path/= if you put =vcfpp.h= there.

#+begin_src C++ #include "vcfpp.h" using namespace std; using namespace vcfpp; int main(int argc, char* argv[]) { // read data from 1000 genomes server BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"); BcfRecord var(vcf.header); // construct a variant record vector gt; // genotype can be bool, char or int type vector hetsum(vcf.nsamples, 0); while (vcf.getNextVariant(var)) { var.getGenotypes(gt); if (!var.isSNP() || !var.isNoneMissing()) continue; assert(var.ploidy()==2); // make sure it is diploidy for(int i = 0; i < gt.size() / 2; i++) hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]); } for (auto i : hetsum) { cout << i << endl; } return 0; } #+end_src

** Genotypes Coding

There are 3 types used for genotypes, ie. =vector=, =vector= and =vector=. One can use =vector= and =vector= for memory-effcient goal. The downside is that it only stores 0 and 1. And =vector= can store missing values and multialleles.

*** Code genotypes with missing allele as heterozygous.

If you use =vector= and =vector= to store the genotypes, then there is no way to represent missing values. Hence the returned genotypes always have 0s and 1s. And genotypes with missing allele (eg. =0/.=, =./0=, =1/.=, =./1=, =./.=) are codes as =1/0=. It's recommended to use =var.isNoneMissing()= to check if there is missing value.

*** Code missing allele as -9

If this default behavior for =vector= and =vector= is not what you want, you should use =vector= to store the genotypes, then any missing allele will be coded as =-9=. Note you should take the missing value =-9= into account for downstream analysis.

** Writing VCF

There are many ways in vcfpp for writing the VCF/BCF file.

*** Use an empty template

Here we construct an initial BCF with header using VCF4.3 specification. Next we add meta data in the header and write out variant record given a string.

#+begin_src C++ BcfWriter bw("out.bcf.gz", "VCF4.3"); bw.header.addFORMAT("GT", "1", "String", "Genotype"); bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)"); bw.header.addContig("chr20"); // add chromosome for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s); // add 3 samples bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0"); #+end_src

*** Use another VCF as template

In this example, we first read VCF file =test/test-vcf-read.vcf.gz=. Secondly, we construct an empty variant record and update the record with the input VCF. Thirdly, we construct a BcfWriter object using the meta data in the header of the input VCF, writing out the header and the modified variant record.

#+begin_src C++ BcfReader br("test/test-vcf-read.vcf.gz"); BcfRecord var(br.header); br.getNextVariant(var); BcfWriter bw("out.vcf.gz", br.header); bw.writeHeader(); var.setPOS(100001); // update the POS of the variant bw.writeRecord(var); #+end_src

** Variants Operation

All variants related API can be found [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_record][BcfRecord]]. The commonly used are listed below.

#+begin_src C++ BcfReader vcf("bcf.gz"); // construct a vcf reader BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header vcf.getNextVariant(var) // get next variant vector gt; // genotype can be bool, char or int type var.getGenotypes(gt), var.setGenotypes(gt); // get or set genotypes for current variant var.isNoneMissing(); // check if there is missing value after getting genotypes vector gq; // genotype quality usually is of int type var.getFORMAT("GQ",gq), var.setFORMAT("GQ",gq); // get or set a vector of genotypes quality vector pl; // Phred-scaled genotype likelihoods usually is of int type var.getFORMAT("PL",pl); // get a vector of Phred-scaled genotype likelihoods float af; var.getINFO("AF", af), var.setINFO("AF", af); // get or set AF (allele frequency) value in INFO int mq; var.getINFO("MQ",mq) // get MQ (Average mapping quality) value from INFO vector dp4; // Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases var.getINFO("DP4", dp4), var.setINFO("DP4", dp4); // get or set a vector of dp4 value from INFO var.isSNP(); // check if variant is SNP var.isSV(); // check if variant is SV var.isIndel(); // check if variant is indel var.isMultiAllelic(); // check if variant is MultiAllelic var.POS(), var.setPOS(); // get POS or modify POS #+end_src

** Header Operation

All variants related API can be found in [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_header][BcfHeader]].

  • Working with R

Examples of vcfpp working with R are in folder [[Rcpp]] and https://github.com/Zilong-Li/vcfppR.

  • Working with Python

Examples of vcfpp working with Python are in folder [[Pybind11]].

  • Command Line Tools

Find more useful command line tools in folder [[tools]].