tskit icon indicating copy to clipboard operation
tskit copied to clipboard

provide feedback on `max_iter` for `extend_haplotypes`

Open petrelharp opened this issue 1 year ago • 5 comments

After #2938 goes in, we need to do something so that the user knows whether the algorithm terminated because of running out of things to do or because it ran into max_iter. And, give guidance in the docs about what max_iter should be. I propose throwing a warning if max_iter is reached; we need to do some experiments to determine a good suggested value.

petrelharp avatar Sep 17 '24 23:09 petrelharp

Return a dataclass consisting of the tree sequence and algorithm convergence details? Relying on side channels for info like this is tricky, I'd suggest something like

@dataclasses.dataclass
class ExtendHaplotypesResult:
     tree_sequence: TreeSequence
     iterations: int

jeromekelleher avatar Sep 18 '24 09:09 jeromekelleher

@hfr1tz3 has done some experiments here:

  1. doing 1e4 samples on 50Mb was taking like 10min per iteration
  2. nearly all the edges are added in the first iteration - like 99% on first iteration; 1% on second iteration; just a handful (like 3) in remaining iterations
  3. it always terminates within 5 iterations
  4. this is confirmed on many more reps of fewer samples

Proposal is to (a) set default to 10; and say in docstring that if speed is a concern, setting to 1 or 2 should get nearly everything; and it is possible though unlikely that it hasn't converged by 10 - re-running will verify.

So: no need for a dataclass.

petrelharp avatar Oct 21 '24 20:10 petrelharp

Can you imagine this changing in future, e.g, using a slightly different algorithm? What if we returned those numbers of edges per iteration in the dataclass? It worry that people won't get these nuances unless the data is readily available in the result

jeromekelleher avatar Oct 22 '24 08:10 jeromekelleher

I can imagine it, but I don't think it's terribly likely? If we did want to do that, we'd probably give the method a new name and deprecate the old one? I'm proposing this because the numbers seem very clear that users can ignore this detail 98% of the time, and the note about speeding it up by setting max_iter=1 will get the remaining 2%. So, I'd rather keep this as a simple "ts tranformation" method, like the others, that returns a modified ts.

petrelharp avatar Oct 22 '24 16:10 petrelharp

Ah ok, I forgot this had been released and documented already. Ignore what I said then

jeromekelleher avatar Oct 22 '24 19:10 jeromekelleher

Closing for inactivity and labelling "future", please re-open if you plan to work on this.

benjeffery avatar Jun 15 '25 23:06 benjeffery