tsinfer
tsinfer copied to clipboard
Library errors encountered when using make_ancestors_ts()
I'm trying to make an ancestor tree sequence out of a standard tree sequence generated using msprime using the unexposed function make_ancestors_ts(), which is part of eval_util.py.
I'm trying out the function using the following block of code:
import tsinfer
from tsinfer import make_ancestors_ts
import tskit
import msprime
print(f"tsinfer {tsinfer.__version__}")
print(f"tskit {tskit.__version__}")
print(f"msprime {msprime.__version__}")
rate_map = msprime.RateMap.uniform(
sequence_length = 1_000_000,
rate = 1e-8
)
ts = msprime.sim_ancestry(
samples = 1_000,
population_size = 10_000,
ploidy = 2,
model = "hudson",
recombination_rate = rate_map
)
mts = msprime.sim_mutations(
ts,
rate = 1e-8
)
ts_anc = make_ancestors_ts(samples = mts.samples(), ts = mts, remove_leaves = True)
I've run the block of code multiple times. Sometimes it works, but often it leads to two errors (usually the first):
- LibraryError: Specified parent mutation is at a different site.
- LibraryError: Mutation out of bounds.
Huh - I thought the problem was that we weren't updating mutation parents, but it looks like we are:
https://github.com/tskit-dev/tsinfer/blob/6ed592f62d3c92c00eb918ef2bcb50c8c8171bad/tsinfer/eval_util.py#L481]
@szhan can you make a reproducible example here please by setting the random seeds, and pasting in the stack trace?
This pair of random seeds led to the stack trace below: (1) sim_ancestry - 354053995 (2) sim_mutations - 4054152783
---------------------------------------------------------------------------
LibraryError Traceback (most recent call last)
Input In [22], in <module>
----> 1 ts_anc = make_ancestors_ts(samples = mts.samples(), ts = mts, remove_leaves = True)
File ~/opt/anaconda3/envs/tskit-play/lib/python3.8/site-packages/tsinfer/eval_util.py:458, in make_ancestors_ts(samples, ts, remove_leaves)
456 if np.sum(var.genotypes) > 1:
457 position.append(var.site.position)
--> 458 reduced = subset_sites(ts, position)
459 minimised = inference.minimise(reduced)
461 tables = minimised.dump_tables()
File ~/opt/anaconda3/envs/tskit-play/lib/python3.8/site-packages/tsinfer/eval_util.py:440, in subset_sites(ts, position)
432 for mutation in site.mutations:
433 tables.mutations.add_row(
434 site_id,
435 node=mutation.node,
(...)
438 metadata=mutation.metadata,
439 )
--> 440 return tables.tree_sequence()
File ~/opt/anaconda3/envs/tskit-play/lib/python3.8/site-packages/tskit/tables.py:3258, in TableCollection.tree_sequence(self)
3256 if not self.has_index():
3257 self.build_index()
-> 3258 return tskit.TreeSequence.load_tables(self)
File ~/opt/anaconda3/envs/tskit-play/lib/python3.8/site-packages/tskit/trees.py:3723, in TreeSequence.load_tables(cls, tables, build_indexes)
3720 @classmethod
3721 def load_tables(cls, tables, *, build_indexes=False):
3722 ts = _tskit.TreeSequence()
-> 3723 ts.load_tables(tables._ll_tables, build_indexes=build_indexes)
3724 return TreeSequence(ts)
LibraryError: Specified parent mutation is at a different site
The short version of avoiding this is to make sure you have only two alleles @szhan. It's probably simplest to stick with old style infinite sites simulations for your work, for now. I think you just need to use discrete_genome=False when doing the mutation simulations.
Thanks, @jeromekelleher! By setting discrete_genome to False, I no longer get that error.
for _ in range(1_000):
rate_map = msprime.RateMap.uniform(
sequence_length = 1_000_000,
rate = 1e-8
)
ts = msprime.sim_ancestry(
samples = 1_000,
population_size = 10_000,
ploidy = 2,
model = "hudson",
recombination_rate = rate_map,
discrete_genome = False
)
mts = msprime.sim_mutations(
ts,
rate = 1e-8,
discrete_genome = False
)
ts_anc = make_ancestors_ts(samples = mts.samples(),
ts = mts,
remove_leaves = True)
I guess that when I convert the continuous coordinates to discrete coordinates to print to VCF, I need ensure that no site positions are duplicated.
I'm going to leave this open as there are issues to be resolved (but good you have a temporary workaround @szhan)
Re VCFs, there's a position transform option which gives you control over what happens to the coordinates on outout to vcf.