scirpy icon indicating copy to clipboard operation
scirpy copied to clipboard

Normalized Hamming Distance

Open surajg4 opened this issue 4 years ago • 4 comments

It would be better to have a normalization options in distance metric for BCR support. Still not clear on how to use the abstract class DistanceCalculator (tutorial could help)

surajg4 avatar Mar 25 '21 22:03 surajg4

Pinging @ktpolanski, who added the Hamming Distance Feature: What do you think about the normalization?


Regarding the DistanceCalculator: What's your question in particular? I think it should be feasible to implement a custom distance calculator that inherits from the abstract base class by looking at the examples and docstrings in metrics.py.

grst avatar Mar 26 '21 08:03 grst

Does normalisation imply dividing by the length of the sequence? If so, then that puts us pretty close to what Dandelion does, no?

ktpolanski avatar Mar 26 '21 12:03 ktpolanski

Does normalisation imply dividing by the length of the sequence? If so, then that puts us pretty close to what Dandelion does, no?

This would put it more in line with what immcantation's ShaZam does. Both scirpy and dandelion groups the sequences by distance first before performing the calculations, hence doesn't require length normalization.

The normalization in ShaZam is done as follows:

if (normalize == "len") { 
   dist_mat <- dist_mat / seq_length

for dist_mat, they use a sliding window approach to bin each sequence into 5mers (replacing gaps with Ns, and padding Ns to the first and last 5mer to get the size the same), and then hamming distance is calculated for each pair of 5mers in order. The total distance is then sumed, and subsequently divided by sequence length according to the function above.

This should allow BCRs with different lengths to be grouped as a clonotype if they pass the similarity cut-off, but keep it a substitution-only context. immcantation reccomends a model-based approach for the cut-off, by looking for bimodal pattern of the distribution of the normalized hamming distance, but also leaves it up to the user to define a manual cut-off based on visual inspection of the histogram.

It's frequently used and I can understand its appeal as it allows for more relaxed/unbiased grouping and discovery of potential related BCR patterns that use the same V- and J-genes. It does "violate" the same length requirement for BCR SHM that textbooks teaches us but you could potentially argue it's due to technical things like sequencing error.

An easy way to do all this is to parse to AIRR format and access the immcantation tools directly, or parse to dandelion where wrappers for these immcantation tools are available, and then read it back to scirpy. Or you could try and implement it in some fashion here but would require a new class to handle it.

zktuong avatar Mar 30 '21 09:03 zktuong

I think we're all talking about slightly different things. If we take the Hamming distance and divide it by sequence length, we'll obtain a "percent of mismatches" measure, which is what dandelion does. That is, if this is what OP meant by "normalised Hamming".

ktpolanski avatar Mar 30 '21 10:03 ktpolanski