GraphUnzip
GraphUnzip copied to clipboard
Unzip assembly graphs with Hi-C data and/or long reads.
GraphUnzip
Unzips an assembly graph using Hi-C data and/or long reads and/or linked reads. This branch is the master branch, use the article branch to reproduce the results of the article.
Why use GraphUnzip ?
GraphUnzip
improves the contiguity of an assembly and duplicates collapsed homozygous contigs, aiming at reconstituting an assembly with haplotypes assembled separately. GraphUnzip
untangles an uncollapsed assembly graph in GFA format. Its naive approach makes no assumption on the ploidy or the heterozygosity rate of the organism and thus can be used on highly heterozygous genomes or metagenomes. If you want to know when GraphUnzip may be useful to you, take a look at when is GraphUnzip useful below.
Combined with a short read assembler, GraphUnzip
makes a great hybrid (short/long read) assembler: go to the bottom of the page to see an example.
Installation
GraphUnzip
requires python3 with numpy, scipy and zlib.
To read bam-format data (Hi-C or linked reads) you'll also need pysam.
There are 4 options available for installing GraphUnzip:
- Clone from this repository and install as a local Python package. This is the best way to ensure you have the latest development version.
git clone https://github.com/nadegeguiglielmoni/GraphUnzip.git && cd graphunzip && pip install -e .
- Pip install directly from this git repository.
pip install git+https://github.com/nadegeguiglielmoni/GraphUnzip.git
- Install from PyPi.
pip install graphunzip
- Install from Bioconda.
conda install -c bioconda graphunzip
Run graphunzip --help
to verify installation.
Usage
Input
GraphUnzip
needs two things to work :
An assembly graph in GFA 1.0 format and any combination of :
- Hi-C data : GraphUnzip needs either 1) the Hi-C reads mapped to the assembly in name-sorted bam format or 2) a sparse contact matrix and a fragment list using the formats outputted by hicstuff. You can use hicstuff to obtain these files, using preferably iterative mode :
awk '/^S/{print ">"$2"\n"$3}' assembly.gfa > assembly.fasta #produce a fasta file from the gfa
hicstuff pipeline -t 8 --mapping=iterative -o mapping/ -g assembly.fasta -e DpnII HiC_reads_forward.fq HiC_reads_reverse.fq
and/or
- Long reads (mapped to the GFA in the GAF format of GraphAligner). The best is to use an old version of GraphAligner (commit
5217838b436fee4eda5824aabee99406db2a137b
) with option--global-alginment
, otherwise you can use a more recent version with option--multimap-score-fraction 1
.
GraphAligner --global-alignment -x vg -f reads.fq -g graph.gfa -a longreads_aligned_on_gfa.gaf
and/or
- Barcoded linked reads mapped to the contigs of the assembly in SAM format. Barcodes need to be designated in the SAM by a BX:Z: tag (e.g. BX:Z:AACTTGTCGGTCAT-1) at the end of each line. A possible pipeline to get this file from barcoded reads using BWA would be:
awk '/^S/{print ">"$2"\n"$3}' assembly.gfa > assembly.fasta #produce a fasta file from the gfa
bwa index assembly.fasta #index the fasta file of the assembly
bwa mem assembly barcoded_reads.fastq -C > reads_aligned_on_assembly.sam #align the barcoded reads to the assembly : the -C option is very important here, to keep the barcodes in the sam file
Note: Linked reads support is an experimental option we added on demand from some users. It has not been extensively tested. We also expect results to be poorer than what is obtained using Hi-C or long reads.
Running GraphUnzip
To use GraphUnzip
, you generally need to proceed in two steps :
- If using Hi-C or linked reads, build interaction matrix(ces) (a matrix quantifying the pairwise interaction between all contigs): for that use the
HiC-IM
, orlinked-reads-IM
command, depending on which type of data you dispose. You will have to specify the files to which these interaction matrices will be written.
#for Hi-C
graphunzip HiC-IM -m path/to/abs_fragments_contacts_weighted.txt -F path/to/fragments_list.txt -g assembly.gfa --HiC_IM hic_interactionmatrix.txt
#for linked reads
graphunzip linked-reads-IM --barcoded_SAM reads_aligned_on_assembly.sam -g assembly.gfa --linked_reads_IM linkedreads_interactionmatrix.txt
- Use the command
unzip
to unzip the graph using the interaction matrices built beforehand and/or the gaf file if using long reads.
#let's unzip our gfa using linked-reads, Hi-C and long reads :
graphunzip unzip -g assembly.gfa -i hic_interactionmatrix.txt -k linkedreads_interactionmatrix.txt -l longreads_aligned_on_gfa.gaf -o assembly_unzipped.gfa
Options
GraphUnzip has 5 sub-modules:
- unzip: untangle the GFA file
- purge: retain only haploid contigs
- extract: extract haploid assembly with a close reference genome
- HiC-IM: to prepare Hi-C data
- linked-reads-IM: to prepare linked reads data
graphunzip --help
usage: graphunzip [-h] command
positional arguments:
command Sub-command must be one of:
unzip (untangle the GFA file),
purge (retain only haploid contigs),
extract (extract haploid assembly with a close reference genome),
HiC-IM (to prepare Hi-C data) or
linked-reads-IM (to prepare linked reads data)
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
To run command unzip:
graphunzip unzip -h
usage: graphunzip [-h] -g GFA [-i HICINTERACTIONS] [-k LINKEDREADSINTERACTIONS] [-l LONGREADS] [-o OUTPUT]
[-f FASTA_OUTPUT] [-v] [-r] [--dont_merge] [-c] [-b]
optional arguments:
-h, --help show this help message and exit
Input of GraphUnzip:
-g GFA, --gfa GFA GFA file to phase
-i HICINTERACTIONS, --HiCinteractions HICINTERACTIONS
File containing the Hi-C interaction matrix from HiC-IM [optional]
-k LINKEDREADSINTERACTIONS, --linkedReadsInteractions LINKEDREADSINTERACTIONS
File containing the linked-reads interaction matrix from linked-reads-IM [optional]
-l LONGREADS, --longreads LONGREADS
Long reads mapped to the GFA with GraphAligner (GAF format) or SPAligner (TSV format) [optional]
Output of GraphUnzip:
-o OUTPUT, --output OUTPUT
Output GFA [default: output.gfa]
-f FASTA_OUTPUT, --fasta_output FASTA_OUTPUT
Optional fasta output [default: None]
Behavior of GraphUnzip:
-c, --conservative (Hi-C only) Output very robust contigs. Use this option if the coverage information of the graph is not reliable
-b, --bold (Hi-C only)[default] Proposes the best untangling it can get (can be misled by approximate coverage information). Use this option if the contig coverage information of the graph can be trusted
-e, --exhaustive (long reads only) All links not found in the .gaf will be removed
Other options:
-v, --verbose
-r, --dont_rename Use if you don't want to name the resulting supercontigs with short names but want to keep the names of the original contigs
--dont_merge If you don't want the output to have all possible contigs merged
To run command HiC-IM:
graphunzip HiC-IM --help
usage: graphunzip [-h] -g GFA_GRAPH -m MATRIX -F FRAGMENTS [--HiC_IM HIC_IM]
optional arguments:
-h, --help show this help message and exit
-g GFA_GRAPH, --gfa_graph GFA_GRAPH
GFA file that will be untangled (required)
-m MATRIX, --matrix MATRIX
Sparse Hi-C contact map (required)
-F FRAGMENTS, --fragments FRAGMENTS
Fragments list (required)
--HiC_IM HIC_IM Output file for the Hi-C interaction matrix (required)
To run command linked-reads-IM:
graphunzip linked-reads-IM --help
usage: graphunzip [-h] -g GFA_GRAPH -p--linked_reads_IM P__LINKED_READS_IM
-b BARCODED_SAM
optional arguments:
-h, --help show this help message and exit
-g GFA_GRAPH, --gfa_graph GFA_GRAPH
GFA file that will be untangled (required)
-p--linked_reads_IM P__LINKED_READS_IM
Output file for the linked-read interaction matrix
(required)
-b BARCODED_SAM, --barcoded_SAM BARCODED_SAM
SAM file of the barcoded reads aligned to the
assembly. Barcodes must still be there (use option -C
if aligning with BWA) (required)
Hybrid assembly
Combined with a short read assembler, GraphUnzip makes a great hybrid (short reads - long reads) assembler. Here is a suggested pipeline.
Intallation
You'll need a working python installation to run this pipeline.
If not already done, download GraphUnzip:
git clone https://github.com/nadegeguiglielmoni/GraphUnzip.git
Install SPAdes to have both a short read assembler and an aligner (SPAligner). You can use another assembler if you prefer, but the installation of SPAdes is still recommended to have access to SPAligner. On Linux, the commands are:
wget http://cab.spbu.ru/files/release3.15.3/SPAdes-3.15.3-Linux.tar.gz
tar -xzf SPAdes-3.15.3-Linux.tar.gz
Short read assembly
Run the short read assembler. If you are using SPAdes,
SPAdes-3.15.3-Linux/bin/spades.py --s short_reads.fastq -o short_read_assembly
This is in case the short reads are unpaired. If using another type of library or if you want to tune other options, please refer to spades.py --help
.
Read alignment
We will use SPAligner to align long reads to the assembly graph. If you want to tune the parameters, refer to the gitHub of SPAligner.
SPAdes-3.15.3-Linux/bin/spaligner SPAdes-3.15.3-Linux/share/spaligner/spaligner_config.yaml -d pacbio -g short_read_assembly/assembly_graph_with_scaffolds.gfa -k 127 -s long_reads.fastq.gz
Untangling the short-read assembly
Now we use GraphUnzip:
GraphUnzip/graphunzip -g short_read_assembly/assembly_graph_with_scaffolds.gfa -l spaligner_result/alignment.tsv -o assembly.gfa -f assembly.fasta
The final assembly are assembly.gfa (GFA format) and assembly.fasta (FASTA format)
When is GraphUnzip useful ?
It is tempting to try to use GraphUnzip on any assembly to improve its contiguity. And you can ! Yet on some assemblies it will not improve the results at all. You can generally know that beforehand by looking at what the assembly graph looks like with the tool Bandage.
GraphUnzip untangles assembly graphs. Thus it likes having messy, tangled graphs as input. Here is an example of an assembly on which GraphUnzip will probably do well:
On the contrary, some assemblies are very fragmented. For those, GraphUnzip cannot do much, since it cannot reconstitute the missing sequence between two contigs. You might consider using a scaffolder instead. Here is an example of a very fragmented assembly, which cannot be untangled much more:
Citation
Please cite GraphUnzip
using the preprint:
GraphUnzip: unzipping assembly graphs with long reads and Hi-C Roland Faure, Nadège Guiglielmoni and Jean-François Flot, bioRxiv (2020). doi: https://doi.org/10.1101/2021.01.29.428779