tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Root node in real-data inferred tree sequence has huge number of children

Open awohns opened this issue 5 years ago • 4 comments

The root node of the tgp tree sequence has hundreds of thousands of child edges, a large proportion of which (>50%) are sample edges. This can cause issues with tsdate. Would you say this is expected @jeromekelleher? Perhaps sample nodes go to the root when there's no inferred ancestor proves to be a good match?

awohns avatar May 10 '20 19:05 awohns

I should say that this node is not the root everywhere, but is the root for a substantial portion (perhaps the majority) of the chromosome.

hyanwong avatar May 10 '20 20:05 hyanwong

I think it's an artefact of the current exact-matching-only approach - hopefully this will be reduced when we've tuned the new recombination/mutation rate parameters.

jeromekelleher avatar May 11 '20 08:05 jeromekelleher

I think this is fixed by https://github.com/tskit-dev/tsinfer/pull/687

hyanwong avatar Sep 07 '22 12:09 hyanwong

Can we see if this is now fixed in tsinfer 0.3 and if so, close this issue, @awohns ? Perhaps @szhan would be able to help make a new TGP tree sequence using the pipeline in e.g. the unified genealogy paper and compare the distribution of number of children per node for the root nodes?

hyanwong avatar Oct 25 '22 21:10 hyanwong