hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

MM tag and hard clipped alignments

Open jts opened this issue 2 years ago • 30 comments

The MM tag describes the modification string relative to the sequencing read as generated by the instrument. For instance, consider this read, where only the 3rd C is modified:

read_a    ACCTCGCCA

If the read is aligned end-to-end, the CIGAR, SEQ and modification tag would be 9M, ACCTCGCCA and MM:Z:C+m,2. If the mapper has hard clipped the first two bases of the read however then the modification tag is the same but the CIGAR and SEQ will be 2H7M, CTCGCCA. A naive interpretation of the MM tag will indicate that the wrong base is modified (in this case, the 4th C from the original read).

This situation is not currently handled by the MM tag specification. I would suggest discouraging the use of hard clips (e.g. suggest the -Y flag is used for minimap2) and saying something like "MM tags are invalid when the alignment contains hard clips".

jts avatar May 18 '22 16:05 jts

The topic of hard clipping came up during the original PR discussion, in https://github.com/samtools/hts-specs/pull/418#issuecomment-726143844 and the following comments. I believe we concluded that using MM:Z:N+m,4 to represent this modification of the 3rd C would be immune to hard clipping by a non-MM-aware mapper (from https://github.com/samtools/hts-specs/pull/418#issuecomment-742569660, which I think is the last clipping related comment):

I think I'm satisfied that using the N-based notation works around [the hard-clipping problem]. So all we would need to do is add a sentence recommending that only N be used if the data is to be processed and potentially hard-clipped by a non-MM-aware tool.

However I don't think we ever added such a sentence.

That would be a more generous alternative to your suggested sentence disallowing MM+hard clipping entirely.

jmarshall avatar May 18 '22 18:05 jmarshall

Thanks @jmarshall, I missed that when reviewing the original discussion before posting this issue. I agree using N works around this issue.

Is the mapper allowed to modify/fix MM tag when hard-clipping the reads? If so a downstream consumer of mod bams will need a way to determine whether the non-N MM tags for hard-clipped reads are reliable or not.

jts avatar May 18 '22 19:05 jts

You're right that — either way — this needs more details in the spec.

The N notation enables non-MM-aware mappers to hard clip reads and leave MM as is, such that downstream tools that do understand MM can parse MM-with-Ns correctly (on the understanding that the N counts include the bases within 2H in their counting).

What MM-aware mappers should do is another question. For either exact-base-MM or MM-with-Ns, they would be capable of adjusting the counts, so that downstream tools could parse either correctly by considering the 2H bases not to be included in the counts.

These two are inconsistent… What to do?

  • Disallow the combination of hard clipping and MM
  • Specify that H bases are never included in the MM counts, and accept that non-MM-aware mappers will corrupt MM on reads that they hard-clip (currently the spec is not explicit whether hard-clipped bases are included in the counting)
  • Specify that for MM-with-Ns, H bases are counted, but for MM-with-exact-bases, H bases are not counted (the inconsistency is not attractive!)
  • Add another optional flag alongside . and ? that says “the counting starts after any hard-clipped bases”
  • Something else…

jmarshall avatar May 18 '22 19:05 jmarshall

I've been mulling this over for a few days and leaning towards adding another flag. Without a flag the counts can be relative to the original (possibly unknown) sequence, and when the flag is set the counts are relative to SEQ instead.

jts avatar May 20 '22 19:05 jts

This is one of those things which just becomes akward once hard clipping is used. We could state that if hard clipping is to be used then the N notation must be used, but it's punting the problem elsewhere and you'd need to set a flag in the methylation caller (which may not even exist) before hand, meaning the person doing the caller must know of the aligner options a downstream user may use. That may not be possible if your sequencing is split up into a data provider (eg a sequencing centre) and a data consumer (some research group). It's just messy!

I think it's best to recommend that hard-clipping isn't used. We can and should also be explicit with how to interpret the data when something wrong happens. So either something like this:

If an MM tag exists on a record with hard clips in the CIGAR field, the MM tag must be ignored.

Or if we wish to permit the possibility of working around with the N code:

Base numbering in the MM tag refers to the original sequence produced by the base caller. If the sequence has been hard-clipped using the H CIGAR code, then the MM tag is only valid if it uses the N fundamental base. In this case the size of the soft-clip must be used when computing the position along the SEQ field.

I prefer the former as it's simple, and the N trick feels a bit like a hack (similar to how someone would have to handle e.g. U2/E2, although AFAIK they're never used). Frankly hard-clips only tend to occur with secondary alignments, but I guess there may still be some desire to know base modification statuses for multiple alignments?

This does however mean there's no way for an MM aware aligner to work properly, which I assume is the reason behind @jmarshall 's comment above about an additional flag. If we add this, then we'd need a similar language to above to describe the flag-less state (ie it's invalid so ignore), as well as a description of the flag so we can document how to do things properly. So I could get behind this additional flag proposal too.

Care to suggest any recommended code? It could be something inline like ? and ., or it could be a separate general purpose aux tag to indicate the starting position for the MM tag. Eg for where only part of a sequence has been analysed, but also for where the sequence is part of a larger unrecorded fragment.

Edit: or (ghastly!) actually the only additional information we need when hard clipping is how many of each base type were clipped. I don't like where that's going though, and just starting so-many bases into the original unclipped fragment is a cleaner approach. Either way the presence of such data, as is the presence of an additional character flag, is what distinguishs the hard-clipped record with MM tag from being valid or invalid.

jkbonfield avatar May 23 '22 11:05 jkbonfield