metagraph icon indicating copy to clipboard operation
metagraph copied to clipboard

Dynamic update example

Open pirovc opened this issue 1 month ago • 2 comments

Hello! In the MetaGraph article there is a description of 3 ways to update indices, in summary:

  1. "two separated indexes can be queried simultaneously"
  2. "graph can be updated directly if it is represented using a dynamic data structure"
  3. "the existing index can be reconstructed entirely"

However I could not find in the documentation a way to execute any of those actions. Could you please point me how to update an existing metagraph index and what would be the best practice for this?

Cheers!

pirovc avatar Dec 04 '25 15:12 pirovc

Hi,

Here's a quick clarification of those three approaches.

1. “Two separate indexes can be queried simultaneously”

In this approach, we build a new index for the additional data, exactly as you did for the original one (metagraph build, metagraph annotate, etc.).

Then, query each index independently (metagraph query ...), and aggregate the results.

This also works in the server mode: Host each index (graph + annotation) in server mode, e.g.

metagraph server_query \
    -i old_graph.dbg \
    -a old_annotation.column.annodbg \
    --port 9000 \
    --parallel 16

metagraph server_query \
    -i new_graph.dbg \
    -a new_annotation.column.annodbg \
    --port 9001 \
    --parallel 16

From your client (Python API or your own HTTP client), send the same query to both servers and merge the results on the client side. The Python API was designed to talk to one or multiple MetaGraph servers and aggregate results.

So strategy (1) basically means treat the original index as immutable, build another one for the new data, and fan-out queries to both.

2. “Graph can be updated directly if it is represented using a dynamic data structure”

This refers to using a dynamic de Bruijn graph implementation (e.g. metagraph build ... --graph succinct --state dynamic, --graph hashstr, or --graph hash), as discussed in the paper. Such graphs can be updated with command metagraph extend, e.g., metagraph extend -i ... new_data.fa.

However, this way is least scalable in practice. Hence, we currently don't provide the end-user documentation in the same way as for build / annotate / transform. I recommend trying to solve the problem via approaches 1 and 3.

3. “The existing index can be reconstructed entirely”

This is effectively “rebuild the index, but reuse contigs/contig buckets instead of raw reads”.

In this approach, we extract contigs from the current graph (non-redundant representation of all indexed k-mers):

metagraph transform \
    -v --to-fasta \
    -o old_contigs \
    -p 16 \
    old_graph.dbg

Then we combine those contigs with contigs or sequences from the new data. For example, assemble+clean the new reads into contigs (as in the recommended workflow).

Then build a joint graph and annotation from the combined contigs:

metagraph build \
    -v -p 16 -k 31 \
    -o updated_graph \
    old_contigs.fasta.gz \
    new_contigs.fasta.gz

This corresponds to strategy (3) from the article: decompose to contigs/“contig buckets”, augment them with new sequences, then construct a new MetaGraph index from those augmented contigs, avoiding a full re-processing of all original raw reads.

What’s the “best practice” today?

In short, in most cases, especially when dealing with large-scale data and updates, maintaining dynamic representations comes at a cost that's too high -- It's more efficient to rebuild the index entirely or build a separate index for the delta and query it independently of the first large index, and then aggregate the results (e.g., in a simple python script).

Generally, I would first suggest trying approach 1: Build an extra index for each batch and query multiple indexes in parallel, aggregating the results client-side (strategy 1).

Otherwise, for large updates or when you want a single unified index, approach 3: Rebuild from contigs using the “contig reconstruction” pattern above, reusing contigs from the existing index or using the preprocessed contigs (contig buckets) that were used to construct that original graph in the first place, so you never have to go back to all original reads.

karasikov avatar Dec 05 '25 16:12 karasikov

Thank you very much for the detailed answer. I will try out the approaches suggested and come back here if something goes wrong. Feel free to close the issue. Cheers.

pirovc avatar Dec 08 '25 08:12 pirovc