xg-old icon indicating copy to clipboard operation
xg-old copied to clipboard

succinct labeled graphs with collections and paths

xg

succinct labeled graphs with collections and paths

Build Status

what?

This library provides an interface to the construction and query of succinct/compressed labeled graphs, labeled collections of nodes and edges, and paths through the graph. The primary motivation is to index variation graphs of the type produced by vg, but in principle the library could be used for any labeled, directed graph.

Graphs indexed with xg can be loaded and queried using variety of high-performance interfaces backed by efficient, succinct data structures from sdsl-lite.

usage

First build:

make
make test

Now you can index a graph constructed with vg, then obtain a subgraph starting at a particular node and extending up to 5 steps away:

./xg -v test/data/z.vg -o z.xg
./xg -i z.xg -n 235 -c 5 | vg view -

vg view allows us to see the graph in GFA format.

See main.cpp for example usage.

challenge

vg's current graph memory model is weak and extremely bloated. It relies on fixed-width 64-bit integer ids and large hash tables mapping these to other entities. This makes it difficult to store in memory, and a general-purpose key-value store (rocksdb) is used to allow low-memory access to the entire graph. Although this design has some advantages, querying the graph requires costly IO operations, and thus use must be managed carefully when developing high-performance applications.

Fully-indexed graphs should be cheap to store and hold in memory, but it doesn't seem there is a standard approach that can be used just for high-performance access to the sequence and identifier space of the graph. Most work has gone into improving performance for querying the text of such a graph (GCSA) or generating one out of sequencing reads (assemblers such as SGA or fermi2).

The basic requirement is a system that a minimal amount of memory to store the sequence of the graph, its edges, and paths in the graph, but still allows constant-time access to the essential features of the graph. The system should support accessing:

  • the node's label (a DNA sequence, for instance, or URL)
  • the node's neighbors (inbound and outbound edges)
  • the node's region in the graph (ranges of node id space that are within some distance of the node)
  • node locations relative to stored paths in the graph
  • node and edge path membership

sketch

In theory we could construct a mutable system based on wavelet tries, but research in this area is very new, and I have not found readily-available code for working with these systems. It should be possible to construct mutable wavelet tries using sdsl-lite as a basis, but at present this may be too complex an objective. An immutable system seems like a straightforward thing to do.

First some definitions. We have a graph G = N, E, P with nodes N = n1, …, n|N|, directed edges E = e1, …, e|E|, and paths P = p1, …, p|P|. Nodes match labels lni to ranks i in the collection of node labels: ni = lni, i. Edges go from one node to another ej = nx, ny. Paths match labels lpk to sets of nodes and edges pk = lpk, {n1, e3, n4, e5, …}.

We first store the concatenated sequences of all elements, S = ln1ln2ln3ln|N|, in the graph in a compressed integer vector, Siv. A second compressed bitvector, Sbv : |Siv|=|Sbv|, flags node starts, providing a system of node identifiers. We can apply rank1(Sbv, x) to determine the node rank/id at a given position in Siv, and we can use select1(Sbv, x) to find the positions in Siv corresponding to node with rank/id x, thus allowing basic navigation of the nodes and their labels.

To store edges we keep compressed integer vectors of node ids for the forward Fiv and reverse Tiv link directions, where Fiv = f1, …, f|N| and fi = i, toi1, …, toi|toi|. Tiv inverts this relationship, providing Tiv = t1, …, t|N| and ti = i, fromi1, …, fromi|fromi|. Recall that i is the rank of the node. Using another bitvector Fbv : |Fbv|=|Fiv| and Tbv : |Tbv|=|Tiv| for we record the first position of each node's entries in Fiv and Tiv. This first position simply records the rank i in Siv. The rest of the positions in the node's range record the ranks/ids of the nodes on the other end of the edge--- on the "to" end in the Fiv and the "from" end in Tiv. If a node has no edges either coming from or going to it, it will only be represented by reference to its own rank in the correspending edge integer vector.

We can represent the path space Pi, …, Pn of the graph using a bitvector marking which entities in the edge-from integer vector Fiv lie in a path. For each traversed node or edge, we mark a 1 in a new bitvector Peibv : |Peibv|=|Fiv|, which is typically sparse and compressed. We mark contained entries with 1 and set the un-traversed nodes and edges to 0. Each path thus maps a label to a list of nodes and edges. To support cycles in which a single node may be traversed multiple times, we also store the path Pi as a vector of node ids Pidi, and use a wavelet tree to provide rank and select operations on this structure to find node ranks in the path. These node ranks map into another vector Poiiv which lists the offset of each node relative to the path. Finally, a bit vector of the length of each path Ppibv in which we store a 1 at each position in the path where a node starts allows us to find the node at a particular position in the path. Rank can be used to determine the node id at a given position in the path. In conjunction these structures allow us to store the paths and employ them as relativistic coordinate systems. Paths can overlap and serve as a form of annotation of features in the path space of the graph, in DNA for instance we might record a gene or exon as a path.