tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Library errors encountered when using make_ancestors_ts()

Open szhan opened this issue 3 years ago • 6 comments

I'm trying to make an ancestor tree sequence out of a standard tree sequence generated using msprime using the unexposed function make_ancestors_ts(), which is part of eval_util.py.

I'm trying out the function using the following block of code:

import tsinfer
from tsinfer import make_ancestors_ts
import tskit
import msprime

print(f"tsinfer {tsinfer.__version__}")
print(f"tskit {tskit.__version__}")
print(f"msprime {msprime.__version__}")

rate_map = msprime.RateMap.uniform(
    sequence_length = 1_000_000,
    rate = 1e-8
)

ts = msprime.sim_ancestry(
    samples = 1_000,
    population_size = 10_000,
    ploidy = 2,
    model = "hudson",
    recombination_rate = rate_map
)

mts = msprime.sim_mutations(
    ts,
    rate = 1e-8
)

ts_anc = make_ancestors_ts(samples = mts.samples(), ts = mts, remove_leaves = True)

I've run the block of code multiple times. Sometimes it works, but often it leads to two errors (usually the first):

  1. LibraryError: Specified parent mutation is at a different site.
  2. LibraryError: Mutation out of bounds.

szhan avatar Mar 09 '22 16:03 szhan

Huh - I thought the problem was that we weren't updating mutation parents, but it looks like we are:

https://github.com/tskit-dev/tsinfer/blob/6ed592f62d3c92c00eb918ef2bcb50c8c8171bad/tsinfer/eval_util.py#L481]

@szhan can you make a reproducible example here please by setting the random seeds, and pasting in the stack trace?

jeromekelleher avatar Mar 09 '22 16:03 jeromekelleher

This pair of random seeds led to the stack trace below: (1) sim_ancestry - 354053995 (2) sim_mutations - 4054152783

---------------------------------------------------------------------------
LibraryError                              Traceback (most recent call last)
Input In [22], in <module>
----> 1 ts_anc = make_ancestors_ts(samples = mts.samples(), ts = mts, remove_leaves = True)

File ~/opt/anaconda3/envs/tskit-play/lib/python3.8/site-packages/tsinfer/eval_util.py:458, in make_ancestors_ts(samples, ts, remove_leaves)
    456     if np.sum(var.genotypes) > 1:
    457         position.append(var.site.position)
--> 458 reduced = subset_sites(ts, position)
    459 minimised = inference.minimise(reduced)
    461 tables = minimised.dump_tables()

File ~/opt/anaconda3/envs/tskit-play/lib/python3.8/site-packages/tsinfer/eval_util.py:440, in subset_sites(ts, position)
    432         for mutation in site.mutations:
    433             tables.mutations.add_row(
    434                 site_id,
    435                 node=mutation.node,
   (...)
    438                 metadata=mutation.metadata,
    439             )
--> 440 return tables.tree_sequence()

File ~/opt/anaconda3/envs/tskit-play/lib/python3.8/site-packages/tskit/tables.py:3258, in TableCollection.tree_sequence(self)
   3256 if not self.has_index():
   3257     self.build_index()
-> 3258 return tskit.TreeSequence.load_tables(self)

File ~/opt/anaconda3/envs/tskit-play/lib/python3.8/site-packages/tskit/trees.py:3723, in TreeSequence.load_tables(cls, tables, build_indexes)
   3720 @classmethod
   3721 def load_tables(cls, tables, *, build_indexes=False):
   3722     ts = _tskit.TreeSequence()
-> 3723     ts.load_tables(tables._ll_tables, build_indexes=build_indexes)
   3724     return TreeSequence(ts)

LibraryError: Specified parent mutation is at a different site

szhan avatar Mar 09 '22 17:03 szhan

The short version of avoiding this is to make sure you have only two alleles @szhan. It's probably simplest to stick with old style infinite sites simulations for your work, for now. I think you just need to use discrete_genome=False when doing the mutation simulations.

jeromekelleher avatar Mar 09 '22 20:03 jeromekelleher

Thanks, @jeromekelleher! By setting discrete_genome to False, I no longer get that error.

for _ in range(1_000):
    rate_map = msprime.RateMap.uniform(
        sequence_length = 1_000_000,
        rate = 1e-8
    )
    
    ts = msprime.sim_ancestry(
        samples = 1_000,
        population_size = 10_000,
        ploidy = 2,
        model = "hudson",
        recombination_rate = rate_map,
        discrete_genome = False
    )
    
    mts = msprime.sim_mutations(
        ts,
        rate = 1e-8,
        discrete_genome = False
    )
    
    ts_anc = make_ancestors_ts(samples = mts.samples(),
                               ts = mts,
                               remove_leaves = True)

I guess that when I convert the continuous coordinates to discrete coordinates to print to VCF, I need ensure that no site positions are duplicated.

szhan avatar Mar 10 '22 12:03 szhan

I'm going to leave this open as there are issues to be resolved (but good you have a temporary workaround @szhan)

jeromekelleher avatar Mar 10 '22 14:03 jeromekelleher

Re VCFs, there's a position transform option which gives you control over what happens to the coordinates on outout to vcf.

jeromekelleher avatar Mar 10 '22 14:03 jeromekelleher