make_prg
make_prg copied to clipboard
Ambiguous path production
I produced nested PRGs (--min_match_length=7, --max_nesting=5) from multiple sequence alignments of plasmodium genes and found that make_prg can produce ambiguous PRGs.
Here's an example I drew out by hand to understand

_o and _c mean site opening and closing nodes, with a numbered ID before. 7T means 7 consecutive Ts.
The path CA9T can be produced going through either sites 33/34/35 or 33/36. This can be horrible for genotyping as if the dataset has that sequence then you have to either say you can't genotype because of ambiguity, or randomly choose one of the two possibilities. The latter could cause comparison issues across datasets. For now in gramtools I've decided to bail out in these cases and null genotype, with a FILTER flag to AMBIG to signal this occurred.
Is it possible to define a total order on alternate paths that spell the same sequence? God help us
Here's another example which occurs when building PRG from m.tuberculosis, sequences from 1,017 samples, variation called with cortex in genomic positions: NC_000962.3:336600-337630 and this region starts at POS 337368
The following sequence: GCTCCGCCGGTCCCGCCGGTCC can be spelled by following two paths:
- 157/1, 160/1, 161/0
- 158/0, 159/0
At some level the problem seems to boil to excessive clustering, ie putting sequences that are very similar in different clusters.
Another example where this has clearly occurred: 209.pdf From Pfalciparum gene DBLMSP, 2,500 input sequences
This should potentially be a single variant site enumerating all the similar sequences.
These must always be horizontally offset, right? Is it possible for this to happen with just one leve of nesting? I guess the answer is yes, you can delete an A and then add two As, or add an A and then add an A
is it possible without indels? i guess not?
The kmer counting underlying the clustering ignores alignment gaps- if that's what you're referring to?
Ie, gaps don't change anything: AAA-AAA and AA-AAAA are the same for clustering (at the moment)
Doh, of course
Rachel says couyld you please put the MSAs that generated these examples in?
Sure, i will need to extract them, and reproduce the graphs- the images are subgraphs from larger PRGs.
I will do this later this week, and also think we can add tests and some work to start alleviating this, if we can clearly define what's desirable and undesirable in these situations
I will propose that the only thing you need to avoid is having two different source/sink paths which spell the same sequence.
I took a closer look at the first example, and I think this is hard to solve for all cases. We might make some improvements here and there, based on some edge cases we find by hand, and it might work for these 3 examples, but this is just a very small number of cases, we won't be sure if it improved overall. I think this is due to a misclustering of sequences, but this is unavoidable as the k-means clustering process is very heuristic, and it might undercluster or overcluster (it is really hard to set the correct inertia parameter for every cluster, if there is one). So, I think that PRGs having ambiguous paths will unavoidably happen. Having this heuristic component also makes it hard to have any algorithm that will work on all situations (I think that whatever algorithm we think about, we can't have a proof of correctness, and we will always be able to develop an edge case in such a way that a misclustering of sequence happens, and this induces ambiguous paths in the PRG).
So, we could think of post-processing algorithms to remove some edges/nodes from the PRG in order to keep a sort of path-spelling uniqueness property. That is not possible neither, I am afraid. In the first graph, it is clear we can remove the CA9T node from the PRG, as it is a redundant node: if we remove it, we can still generate all sequences that we could generate before, and without duplication. So we lose no information, and now we have path-spelling uniqueness. However, it is easy to modify this example in such a way that the CA9T node can't be removed without changing the set of paths that the PRG spells (i.e. add a sequence that can only be generated if going through the CA9T node), and this is a counter-example showing that if a PRG has ambiguous paths, we can't remove nodes or edges in order to break this ambiguity (I don't think breaking ambiguity on the cost of losing paths is worth it).
It seems to me that in some edge cases, we can have path ambiguity in the PRG, and it will be responsibility of the downstream algorithms to be aware and take care of this.
Of course I could be wrong on any of the statements above, and as in bioinformatics, maybe we don't need a correct algorithm, but one that works on our cases (and hope they work on other)...
PS: I am not advocating we replace k-means clustering. I think it makes total sense in make_prg, and we should use it as we are using now
Thanks for pitching in @leoisl , I have two thoughts: i) whatever the upstream construction algorithm, downstream application needs some kind of awareness of path ambiguity if we know it can happen. in gramtools if i detect it at genotyping, I null all the calls in the path and set a "AMBIG" filter ii) i'm not sure we can rule out being able to guarantee that all paths in produced graph spell different sequences. even if we cannot make that guarantee, i think the current clustering can be improved so as to reduce the average similarity of sequences in different branches
i think it's an interaction of
- the MSA (can we rely on the MSA algorithm to align the same sequence the same way on two different lines?)
- the kmer-size for collapse k1
- the kmer size for clustering k2
- indels and local small scale indels shorter than k2.
- nesting So i wonder if a) it might be possible to detect that there is >=1 indel and local sequence identity and decide to stop nesting further. b) i think i need to see a bunch of examples for your initial example @bricoletc i think there must be some error. you say
The path CA9T can be produced going through either sites 33/34/35 or 33/36.
I don't see how you can get CAT from 33/34/35, you can get CATTTC
i) whatever the upstream construction algorithm, downstream application needs some kind of awareness of path ambiguity if we know it can happen. in gramtools if i detect it at genotyping, I null all the calls in the path and set a "AMBIG" filter
That is interesting, we have no such things for pandora;
ii) i'm not sure we can rule out being able to guarantee that all paths in produced graph spell different sequences. even if we cannot make that guarantee, i think the current clustering can be improved so as to reduce the average similarity of sequences in different branches
Yeah, I think if I introduce some more formalisation, I can express better what I meant before. Consider the following problem:
Given a MSA M, a max nesting n, a min_match_length l, build a PRG that contains no ambiguous path
I have a feeling that this problem is NP-hard. But feeling does not mean much, a proof would be better... It might help to think about a related, but way simpler decision problem:
Given a labelled DAG G, does G have ambiguous paths?
I also have the feeling this previous problem is hard (i.e. NP-complete), so it is very unlikely we have a polynomial-time algorithm for it. But still no proof...
So I don't have much :p, we need to invest more time to come up with a proof if we want...
But in the end, we can look more at the practical side instead of the theoretical one... unless we have an idea for a polynomial-time algorithm, then the theoretical side is important. We have a trivial algorithm for the last problem if we go exponential time (i.e. list all paths, and check if we have duplicated paths), but I don't think it is a good idea... It will probably be very slow for gramtools PRGs (because they are whole genome, and we will have an exponentially large number of paths), and pandora has also some complex local graphs...
I guess the best direction would be the one you just mentioned, where we can't make the guarantee, but we make a best effort to reduce the average similarity of sequences in different branches... Zam's last reply also lean towards a more heuristic/data-oriented solution
i actually think you might fix it by using smaller kmers in the clustering
The kmer size on clustering is 7, by default... Reducing might solve some cases, but then it might bug some other cases that were not bugged, I guess...
yes, all changes may cause bugs. but smaller kmers for clustering would detect small repeats.
but i think we need a list of 5 or 10 examples of where it happens, with their MSAs
yeah, smaller kmers also cause a lot more repeated kmers (e.g. we have 16384 7-mers, but 256 4-mers)... but the k-means clustering algorithm do take the k-mer multiplicity into account, but I have no idea what is the impact of reducing/increase k-mer size (although it seems to me k=7 is already very small, IDK anything using such small k, but for this particular problem, it can't also be large)... We could try variable k-mer size...
As we are at it, we could also contemplate other ways to measure distance between sequences, or to obtain a description of sequences. Things like doing all pairs edit distance, or some clustering based on edit distance/sequence similarity, so that we are not so sensitive to indels/SNPs (which kmers are a lot), but this might change a lot the PRGs...
We need examples before we speculate - intuition from data and then move to possible solutions. one issue is that we drop a vertical slice down a matching kmer, marked KKKKK, but because of the msa and indels, we might have
record 1: aaa KKKK bbbb record 2. aa KKKK abbbb
then clustering on either side of the K's is missing the issue that the same sequence is split on either side of the K's.
Interesting- though i think the problem of creating same seq in two paths that cross multiple var sites is different from path repeats (lets call them something? like path repeats) occurring within a single (nested) var site. The latter would seem more workable by modifying the clustering only
You are right, but at the level of make-prg there is no concept of sites, apart from where you drop vertical slices of matches, separating regions within which there may be multiple sites.
Here's another example which occurs when building PRG from m.tuberculosis, sequences from 1,017 samples, variation called with
cortexin genomic positions:NC_000962.3:336600-337630and this region starts at POS337368The following sequence:
GCTCCGCCGGTCCCGCCGGTCCcan be spelled by following two paths:
- 157/1, 160/1, 161/0
- 158/0, 159/0
Here's the MSA causing this path repeat containing graph (as per @rmcolq request):
(PS: this thread is becoming really long and unreadable, dont know if there's a solution to this)
>ref
gctccgccggtcccgccggtcc
>sample_48
gctccgccgggcccgccggtcc
>sample_84
tctccgccggtcccgccggtcc
>sample_331
gctcagccggtcccgccggtcc
>sample_347
gctccgccggtcccaccggtcc
>sample_385
gctccgccggtaccgccggtcc
>sample_450
gctccgctggtcccgccggtcc
>sample_577
gctccgccggtcccgctggtcc
>sample_796
gctccgccggtcccgccggtct
>sample_870
gctccgccggtcccgcctgtcc
>sample_952
gctccgccggtcctgccggtcc
@leoisl @iqbal-lab can we establish what we would like this to be constructed as? No indels, kmer count matrix is i)Sparse (lots of 0 counts) ii)pretty 'uniform' (sequences are really quite similar, looks like on average one SNP apart)
Intuitively I might say we should not cluster and have one allele per sequence.
Er..wow. @bricoletc what k size did you use for collapsing?
The default of 7; currently min_match_length is also the kmer size used for clustering. Could be using a kmer a bit smaller than that param would be advantageous :shrug:
I agree about not clustering here "ideally". Still trying to understand. On phone while bloody laptop installs updates
:laughing: I think the key point here, is that the current algorithm will always find a grouping of sequences even if all sequences are extremely similar and there is no strong basis for it. This is because it only requires the inertia (sum of squared distances of data points to their centroids, ie assigned cluster centres) be halved; there is always a way of doing this and i think it mostly does not reach putting each seq in own cluster.
Here's a barebones example i've added as a failing test:
sequences = ["CATATAAAATA", "CATATAACATA", "CATATAAGATA", "CATATAATATA"]
seq_size = len(sequences[0])
alignment = make_alignment(sequences)
expected_clustering = [[record.id] for record in alignment]
for kmer_size in range(4, 7):
result = kmeans_cluster_seqs_in_interval(
[0, seq_size - 1], alignment, kmer_size
)
self.assertEqual(expected_clustering, result)
The sequences all differ at one SNP, we probably shouldnt cluster anything right? The clustering currently returns three groups, one of which has two sequences.
i think you're 100% right.
in case helpful, this is the data @bricoletc posted above here: https://github.com/rmcolq/make_prg/issues/15#issuecomment-714718997
