pyslim
pyslim copied to clipboard
a tree-sequence-validation feature inside pyslim would be useful
The issue https://github.com/MesserLab/SLiM/issues/71 made me think of this. It would be cool if pyslim could perform a validation of a tree sequence, including both the tskit information and the SLiM metadata, to catch a wide variety of problems. SLiM's crosscheck can catch some problems, as we have seen with that issue, but I'm sure there are all kinds of problems that SLiM is not equipped to catch, and in any case it would be good to have a separate validation codebase that doesn't depend on SLiM. I'm thinking of things like:
- inconsistencies in the references across tables
- table entries like sites or mutations that are not referenced at all but have not been stripped out
- as in the linked issue, SLiM metadata inconsistencies like derived states at different positions referring to the same mutation ID
I'm sure one could think of quite a few things to test, and who knows what bugs it might catch for us later; I'm a big believer in self-consistency checks, like SLiM's crosscheck. If it wasn't too slow, it could run automatically on the load of a .trees file; that's the best way to catch problems, is to make the check part of the standard code path when possible, of course.
Sounds like a good idea to me. Some quick thoughts:
- inconsistencies in the references across tables
These are checked when a tree sequence is loaded in tskit here. If the tree sequence loads up in Python, then internal cross references etc are guaranteed to be good.
- table entries like sites or mutations that are not referenced at all but have not been stripped out
The simplest way to catch this is to run simplify. Could check if tables before simply == tables after simplify, depending on how expensive you want to make this.
as in the linked issue, SLiM metadata inconsistencies like derived states at different positions referring to the same mutation ID
That's where pyslim takes over...
Agreed. This would be the first step in having some tools to take a non-SLiM tree sequence and make it ready to load into SLiM, or to modify a SLiM tree sequence in a way that it can be still reloaded. Right now this is still inscruitible to me.
Note: here's some code that checks for some of this:
# check mutations are consistent
mut_info = {}
for m in both.mutations():
for a, md in zip(m.derived_state.split(","), m.metadata['mutation_list']):
if a in mut_info:
assert mut_info[a] == md, f"Mismatch for ID {a}: {mut_info[a]} and {md} differ."
else:
mut_info[a] = md
# check individual IDs are unique
# and consistent with genome IDs
ind_ids = []
for ind in both.individuals():
pid = ind.metadata['pedigree_id']
assert pid not in ind_ids, f"Duplicate individual ID pid"
ind_ids.append(pid)
node_ids = [both.node(n).metadata['slim_id'] for n in ind.nodes]
assert set(node_ids) == set([2 * pid + k for k in range(2)])
From #163: be sure we test for
- duplicate pedigree IDs (see e.g. https://github.com/MesserLab/SLiM/issues/178)
- duplicate mutation IDs
- mismatching mutation information
- mismatching genome + individual IDs