tskit IBD segment definition and memory storage issues

I have spent some time recently with the new IBD segment functionality in the recent 4.1 version. The tutorial is here: https://tskit.dev/tskit/docs/stable/ibd.html. I find that the current IBD segment definition challenges the use of min_span filter parameter to control the memory use when using the store_segments = True option. Namely, tskit currently may view these as distinct IBD segments:

IdentitySegment(left = 2.0, right = 5.0, node = 42)
IdentitySegment(left = 5.0, right = 8.0, node = 42)

If I then use min_span = 4.0, I lose both of these IBD segments, when in reality I would like to stitch together these segments since the left and right endpoints match and the ancestral node is shared. Below is an example Python function to stitch together such IdentitySegment objects in an IdentitySegmentList.

def stitch_ibd(x):
    '''Stitch together IdentitySegment class objects
    
    Parameters
    ----------
    x : IdentitySegmentList
        See tskit documentation
    
    Returns
    -------
    list of IdentitySegment class objects
    '''
    # list and sort
    x = [i for i in x]
    def right(l): # sorting utility
        return l.right
    x.sort(reverse = False, key = right)
    
    # stitch together
    y = []
    i = x[0]
    for j in x[1:]:
        if (i.node == j.node) and (i.right == j.left):
            i = tskit.IdentitySegment(i.left, j.right, i.node)
        else:
            y.append(i)
            i = j
    y.append(i)
    
    return y

Unfortunately, stitching together these small IBD segments results in large memory use because you have to bring in many tiny segments and then stitch them together. I find in simulations that after stitching together tskit IBD segments these segments and the distribution of them match those of IBD detection software like hap-ibd and GERMLINE. I am wondering if there is a better fix to this new utility such that the min_span filter option can be better employed. In practice, many IBD detection software programs will only detect segments that exceed a given cM length threshold.

Feb 24 '22 18:02 sdtemple

Nice observation - I have a question: what about situations like

IdentitySegment(left = 2.0, right = 5.0, node = 42)
IdentitySegment(left = 5.0, right = 8.0, node = 43)

(ie. adjacent and pretty-long segments, but with different ancestral nodes)? If nodes 42 and 43 were equally recent then I would think this situation would affect hap-ibd/GERMLINE just as much. And, if so, you could perhaps put a shorter (but not super short) length filter along with a time filter, and merge the segments? However, it'd be a very interesting empirical observation if it's mostly just situations like this with the same ancestral node that are contributing to differences.

(A small note here: the function you have is very similar to EdgeTable.squash; however, this might not be a useful observation.)

Feb 24 '22 23:02 petrelharp

Thank you for the suggestion to use the EdgeTable.squash method. This may be helpful, and I can look into it more.

For the example you provided, we do not stitch together these IdentitySegment objects because they have different ancestral nodes. An identity by descent segment derives from a most recent common ancestor.

The issue I am bringing to attention I believe is alluded to in the Identity by descent tskit tutorial (images below). This may have to do with the way ts.ibd_segments(store_segments = True, ...) derives the ibd segments by processing in a tree sequence way. The output ibd segments belong to a given tree sequence. Ideally, as we move along the tree sequence, we want to stitch together ibd segments if ancestral nodes match and right endpoint equals left endpoint, versus my function stitch_ibd that stitches together the segments post-hoc. The post-hoc approach is memory intensive because we require the memory for all the marginal ibd segments when calling ts.ibd_segments(store_segments = True, ...). As well, if this stitching is performed within the ts.ibd_segments(store_segments = True, ...) call, we could then use the filtering options max_time and/or min_span. Most ibd detection software cannot reliably infer segments less than 2 cM and/or older than 300 generations..

Feb 25 '22 18:02 sdtemple

For the example you provided, we do not stitch together these IdentitySegment objects because they have different ancestral nodes. An identity by descent segment derives from a most recent common ancestor.

Yes, that's right, but methods like GERMLINE or hap-IBD don't know what the ancestor is, and it's common in post-processing of IBD segments to merge adjacent segments (in which case segments are either IBD or not; with no labeling by common ancestor). That's why I'm wondering if you've tried merging adjacent segments with different nodes as well.

Ideally, as we move along the tree sequence, we want to stitch together ibd segments if ancestral nodes match and right endpoint equals left endpoint,

I believe the current behavior is intentional, actually - the definition that we're using is that segments that are inherited from the same common ancestor but along different paths through the pedigree are different segments. (So, I think that in your example the distinct segments are inherited along different paths). However, I agree that it would be nice to have an option that merges these together, as you're suggesting.

Feb 25 '22 18:02 petrelharp

@petrelharp is right in that the behaviour is intentional and reflects the definition that we're using. We can definitely explore having an option to change that, although it's not immediately obvious to me how easy that would be. If you're interested, the test implementation is the best place to start. The simple definition and examples in the test suite might also be helpful to have a look at.

Feb 28 '22 11:02 jeromekelleher

I have not yet tried to merge adjacent segments with different ancestral nodes. It would be interesting to see the extent to which such a modified procedure impacts the resulting IBD segment statistics.

Thank you both for the suggestions and links to relevant tskit functions and code. I can follow-up in the coming weeks/months as I explore these IBD nuances more.

Feb 28 '22 17:02 sdtemple

tskit tskit copied to clipboard

IBD segment definition and memory storage issues

tskit
tskit copied to clipboard