pyslim a tree-sequence-validation feature inside pyslim would be useful

The issue https://github.com/MesserLab/SLiM/issues/71 made me think of this. It would be cool if pyslim could perform a validation of a tree sequence, including both the tskit information and the SLiM metadata, to catch a wide variety of problems. SLiM's crosscheck can catch some problems, as we have seen with that issue, but I'm sure there are all kinds of problems that SLiM is not equipped to catch, and in any case it would be good to have a separate validation codebase that doesn't depend on SLiM. I'm thinking of things like:

inconsistencies in the references across tables
table entries like sites or mutations that are not referenced at all but have not been stripped out
as in the linked issue, SLiM metadata inconsistencies like derived states at different positions referring to the same mutation ID

I'm sure one could think of quite a few things to test, and who knows what bugs it might catch for us later; I'm a big believer in self-consistency checks, like SLiM's crosscheck. If it wasn't too slow, it could run automatically on the load of a .trees file; that's the best way to catch problems, is to make the check part of the standard code path when possible, of course.

Jan 31 '20 14:01 bhaller

Sounds like a good idea to me. Some quick thoughts:

inconsistencies in the references across tables

These are checked when a tree sequence is loaded in tskit here. If the tree sequence loads up in Python, then internal cross references etc are guaranteed to be good.

table entries like sites or mutations that are not referenced at all but have not been stripped out

The simplest way to catch this is to run simplify. Could check if tables before simply == tables after simplify, depending on how expensive you want to make this.

as in the linked issue, SLiM metadata inconsistencies like derived states at different positions referring to the same mutation ID

That's where pyslim takes over...

Jan 31 '20 14:01 jeromekelleher

Agreed. This would be the first step in having some tools to take a non-SLiM tree sequence and make it ready to load into SLiM, or to modify a SLiM tree sequence in a way that it can be still reloaded. Right now this is still inscruitible to me.

Jan 31 '20 14:01 petrelharp

Note: here's some code that checks for some of this:

# check mutations are consistent
mut_info = {}
for m in both.mutations():
    for a, md in zip(m.derived_state.split(","), m.metadata['mutation_list']):
        if a in mut_info:
            assert mut_info[a] == md, f"Mismatch for ID {a}: {mut_info[a]} and {md} differ."
        else:
            mut_info[a] = md

# check individual IDs are unique
# and consistent with genome IDs
ind_ids = []
for ind in both.individuals():
    pid = ind.metadata['pedigree_id']
    assert pid not in ind_ids, f"Duplicate individual ID pid"
    ind_ids.append(pid)
    node_ids = [both.node(n).metadata['slim_id'] for n in ind.nodes]
    assert set(node_ids) == set([2 * pid + k for k in range(2)])

May 01 '21 13:05 petrelharp

From #163: be sure we test for

duplicate pedigree IDs (see e.g. https://github.com/MesserLab/SLiM/issues/178)
duplicate mutation IDs
mismatching mutation information
mismatching genome + individual IDs

Dec 12 '21 02:12 petrelharp

pyslim pyslim copied to clipboard

a tree-sequence-validation feature inside pyslim would be useful

pyslim
pyslim copied to clipboard