gfabase icon indicating copy to clipboard operation
gfabase copied to clipboard

GFA insert into GenomicSQLite

gfabase

gfabase is a command-line tool for indexed storage of Graphical Fragment Assembly (GFA1) data. It imports a .gfa file into a compressed .gfab file, from which it can later access subgraphs quickly (reading only the necessary parts), producing .gfa or .gfab. Beyond ID lookups, .gfab indexes the graph by mappings onto reference genome coordinates, facilitating navigation within de novo assemblies and pangenome reference graphs.

Effectively, .gfab is a new GFA-superset format with built-in compression and indexing. It is in fact a SQLite (+ Genomics Extension) database populated with a GFA1-like schema, which programmers have the option to access directly, without requiring gfabase nor even a low-level parser for .gfa/.gfab.

Quick start

Each Release includes prebuilt gfabase executables for Linux and macOS x86-64 hosts. The executable provides subcommands:

  • gfabase load -o my.gfab [my.gfa]: create .gfab from a .gfa file (or pipe decompression through standard input)
  • gfabase view my.gfab: dump back to .gfa (if standard output is a terminal, automatically pipes to less -S)
  • gfabase sub my.gfab SEGMENT/PATH/RANGE... [--view]: query for a subgraph, producing either .gfa or .gfab
  • gfabase add-mappings my.gfab mappings.paf: add index of reference genome mappings for GFA segments

The following quick example accesses a scaffold by its Path name in a metaSPAdes assembly of simulated metagenomic reads from Ye et al. (2019); it also uses zstd for decompression.

curl -L "https://github.com/mlin/gfabase/blob/main/test/data/atcc_staggered.assembly_graph_with_scaffolds.gfa.zst?raw=true" \
    | zstd -dc \
    | ./gfabase load -o atcc_staggered.metaspades.gfab

# extract a scaffold from the metagenome assembly (by GFA Path name)
./gfabase sub atcc_staggered.metaspades.gfab -o a_scaffold.gfab --path NODE_2_length_747618_cov_15.708553_3
# view GFA:
./gfabase view a_scaffold.gfab
# or in one command:
./gfabase sub atcc_staggered.metaspades.gfab --view --path NODE_2_length_747618_cov_15.708553_3

The following in-depth notebooks demonstrate human genome uses, also integrating with Bandage for visualization:

  1. Navigating a human de novo assembly
  2. Slicing a pangenome reference graph
index

Segment mappings

Adding --range to gfabase sub means the other command-line arguments are linear sequence ranges (chr1:234-567) to be resolved to overlapping segments. This relies on mappings of each segment to its own linear coordinates, which gfabase load understands in two forms:

  1. The rGFA tags SN:Z and SO:i are present and the segment sequence length is known (from given sequence or LN:i)
  2. Segment tag rr:Z giving a browser-style range like rr:Z:chr1:2,345-6,789

Furthermore, gfabase add-mappings my.gfab mappings.paf adds mappings of segment sequences generated by minimap2 or a similar tool producing PAF format. The .gfab is updated in-place, so make a backup copy if needed.

Connected subgraphs

Adding --connected to gfabase sub expands the subgraph to include the complete connected component(s) associated with the specified segments.

That may be overkill, if we're only interested in the segments' immediate neighborhood. In that case, instead set --cutpoints 1 to extract the associated biconnected component(s), stopping the subgraph expansion at cutpoints (segments that any end-to-end walk of the chromosome must traverse). Setting --cutpoints 2 or higher expands to more-distant cutpoints. The expansion can be modified to disregard cutpoint segments less than L nucleotides long by adding --cutpoints-nt L.

The --connected and --cutpoints expansions treat the segment graph as undirected. Therefore the extracted subgraphs include, but are not limited to, directed "superbubbles."

Web access

gfabase view and gfabase sub can read .gfab http/https URLs directly. The web server must support HTTP GET range requests, and the content must be immutable. This is mainly useful to query for a small subgraph, especially with --no-sequences. On the other hand, a series of queries expected to traverse a large fraction of the graph will be better-served by downloading the whole file upfront.

Here's an example invocation to inspect the subgraph surrounding the HLA locus in a Shasta ONT assembly, remotely accessing a .gfab served by GitHub. (See the above-linked notebooks for details about the flags given.)

./gfabase sub \
    https://github.com/mlin/gfabase/releases/download/v0.5.0/shasta-HG002-Guppy-3.6.0-run4-UL.gfab \
    --view --cutpoints 2 --no-sequences --guess-ranges --range \
    chr6:29,700,000-29,950,000

To publish a .gfab on the web, it's helpful to first "defragment" the file using the genomicsqlite command-line tool made available by pip3 install genomicsqlite or conda install -c mlin genomicsqlite:

genomicsqlite my.gfab --compact --inner-page-KiB 64 --outer-page-KiB 2

...generating my.gfab.compact, a defragmented version that'll be more efficient to access. (mv my.gfab.compact my.gfab if so desired.)

Building from source

CI

Install rust toolchain and:

git clone https://github.com/mlin/gfabase.git
cd gfabase
./cargo build --release

Then find the executable target/release/gfabase.

./cargo is a wrapper for cargo that generates Cargo.toml from Cargo.toml.in, filling in the crate version based on the git tag.