xg

succinct labeled graphs with collections and paths

what?

This library provides an interface to the construction and query of succinct/compressed labeled graphs, labeled collections of nodes and edges, and paths through the graph. The primary motivation is to index variation graphs of the type produced by vg, but in principle the library could be used for any labeled, directed graph.

Graphs indexed with xg can be loaded and queried using variety of high-performance interfaces backed by efficient, succinct data structures from sdsl-lite.

usage

First build:

make
make test

Now you can index a graph constructed with vg, then obtain a subgraph starting at a particular node and extending up to 5 steps away:

./xg -v test/data/z.vg -o z.xg
./xg -i z.xg -n 235 -c 5 | vg view -

vg view allows us to see the graph in GFA format.

See main.cpp for example usage.

challenge

vg's current graph memory model is weak and extremely bloated. It relies on fixed-width 64-bit integer ids and large hash tables mapping these to other entities. This makes it difficult to store in memory, and a general-purpose key-value store (rocksdb) is used to allow low-memory access to the entire graph. Although this design has some advantages, querying the graph requires costly IO operations, and thus use must be managed carefully when developing high-performance applications.

Fully-indexed graphs should be cheap to store and hold in memory, but it doesn't seem there is a standard approach that can be used just for high-performance access to the sequence and identifier space of the graph. Most work has gone into improving performance for querying the text of such a graph (GCSA) or generating one out of sequencing reads (assemblers such as SGA or fermi2).

The basic requirement is a system that a minimal amount of memory to store the sequence of the graph, its edges, and paths in the graph, but still allows constant-time access to the essential features of the graph. The system should support accessing:

the node's label (a DNA sequence, for instance, or URL)
the node's neighbors (inbound and outbound edges)
the node's region in the graph (ranges of node id space that are within some distance of the node)
node locations relative to stored paths in the graph
node and edge path membership

sketch

In theory we could construct a mutable system based on wavelet tries, but research in this area is very new, and I have not found readily-available code for working with these systems. It should be possible to construct mutable wavelet tries using sdsl-lite as a basis, but at present this may be too complex an objective. An immutable system seems like a straightforward thing to do.

First some definitions. We have a graph G = N, E, P with nodes N = n₁, …, n_|N|, directed edges E = e₁, …, e_|E|, and paths P = p₁, …, p_|P|. Nodes match labels l_{n_i} to ranks i in the collection of node labels: n_i = l_{n_i}, i. Edges go from one node to another e_j = n_x, n_y. Paths match labels l_{p_k} to sets of nodes and edges p_k = l_{p_k}, {n₁, e₃, n₄, e₅, …}.

We first store the concatenated sequences of all elements, S = l_n₁l_n₂l_n₃…l_{n_|N|}, in the graph in a compressed integer vector, S_iv. A second compressed bitvector, S_bv : |S_iv|=|S_bv|, flags node starts, providing a system of node identifiers. We can apply rank₁(S_bv, x) to determine the node rank/id at a given position in S_iv, and we can use select₁(S_bv, x) to find the positions in S_iv corresponding to node with rank/id x, thus allowing basic navigation of the nodes and their labels.

To store edges we keep compressed integer vectors of node ids for the forward F_iv and reverse T_iv link directions, where F_iv = f₁, …, f_|N| and f_i = i, to_i₁, …, to_{i_{|to_i|}}. T_iv inverts this relationship, providing T_iv = t₁, …, t_|N| and t_i = i, from_i₁, …, from_{i_{|from_i|}}. Recall that i is the rank of the node. Using another bitvector F_bv : |F_bv|=|F_iv| and T_bv : |T_bv|=|T_iv| for we record the first position of each node's entries in F_iv and T_iv. This first position simply records the rank i in S_iv. The rest of the positions in the node's range record the ranks/ids of the nodes on the other end of the edge--- on the "to" end in the F_iv and the "from" end in T_iv. If a node has no edges either coming from or going to it, it will only be represented by reference to its own rank in the correspending edge integer vector.

We can represent the path space P_i, …, P_n of the graph using a bitvector marking which entities in the edge-from integer vector F_iv lie in a path. For each traversed node or edge, we mark a 1 in a new bitvector Pe_{i_bv} : |Pe_{i_bv}|=|F_iv|, which is typically sparse and compressed. We mark contained entries with 1 and set the un-traversed nodes and edges to 0. Each path thus maps a label to a list of nodes and edges. To support cycles in which a single node may be traversed multiple times, we also store the path P_i as a vector of node ids Pid_i, and use a wavelet tree to provide rank and select operations on this structure to find node ranks in the path. These node ranks map into another vector Po_{i_iv} which lists the offset of each node relative to the path. Finally, a bit vector of the length of each path Pp_{i_bv} in which we store a 1 at each position in the path where a node starts allows us to find the node at a particular position in the path. Rank can be used to determine the node id at a given position in the path. In conjunction these structures allow us to store the paths and employ them as relativistic coordinate systems. Paths can overlap and serve as a form of annotation of features in the path space of the graph, in DNA for instance we might record a gene or exon as a path.

xg-old
xg-old copied to clipboard

Metadata

xg

succinct labeled graphs with collections and paths

what?

usage

challenge

sketch

← Metadata

Owner

Metadata

xg-old xg-old copied to clipboard

Metadata

xg

succinct labeled graphs with collections and paths

what?

usage

challenge

sketch

← Metadata

Owner

Metadata

xg-old
xg-old copied to clipboard