[Route] Precision Issues in the Connection Router
I have found while working on the parallel connection router that there exists situations where the precision of the cost terms is too low and is causing problems.
I have found this problem while investigating a bug in the bitcoin_miner circuit as part of the Titan benchmark suite. What appears to happen is that around iteration 68 of the router, the backward cost appears to grow to around 3e-3 / 4e-3 for some connections. What appears to be happening is that the edge costs are so small that, when added to the backward cost, it has no affect. This makes the circuit appear to be nothing but 0 cost edges towards the target node, which regresses the best-first search into basically a heuristically-driven breadth-first search.
This is technically not a problem for the serial router, it may just cause it to spend more time on convergence; however, this is a problem for the parallel connection router.
I have not tested this, but I think this may have even worst affect when the flat router is enabled. When the flat router is enabled, it will add more (low-cost) edges at the end of the wave expansion which are particularly susceptible to this precision issue. @amin1377 I wonder if this could cause the flat router to perform worst?
I am also wondering if this is the cause for the issues @duck2 is facing with the course-grained parallel router. It appears as though bitcoin_miner is particularly problematic with these precision issues, and turning on flat routing may just make it much worst.
Proposed Solution
I propose that we change the backward and total cost terms stored in the router's data structures from float to double. This will give us more precision and should prevent this issue.
This is kicking the can down the road, since this will happen again for even larger circuits, but I cannot think of any other way of preventing this issue. We need some way of adding a very small number to a very large number without loss of precision, and I am not sure if we can exploit logs for numerical stability.
I have found, at least for the parallel connection router, that this fixed the issue I was facing; so this should likely work for circuits much larger than bitcoin_miner and may be sufficient for the far future.
@vaughnbetz What do you think about changing the cost terms in the router from float to double? Beyond just the performance costs, is there any other reasons why the cost terms were float?
Just using float to keep the storage small, and hopefully gain some performance and memory footprint. There has never been any reason we need double precision -- the terms don't have to be that accurate.
Regarding the flat router, you are right that once the router gets into the cluster, the costs are very small. However, the router lookahead inside the cluster is fully aware of the connectivity within the cluster and prunes the connections that don't lead to the SINK. Thus, I don't think it is a huge problem for the flat router.
Additionally, we have a counter that counts the number of heap operations for intra-cluster and inter-cluster nodes separately. If you look at the percentage, the number of inter-cluster node heap operations is a fraction of the intra-cluster ones. Thus, even if we gain something by moving to double, I don't think it would have a significant effect on flat router run time.
We talked about this in the group meeting. @amin1377 's recent change to cap pres_fac_mult will (I think) solve this problem by ensuring the congestion terms don't get so big that other terms get thrown away.
@vaughnbetz That is correct. I have verified that capping the pres_fac_mult to 1000 resolved the issue (simply) on the parallel router side. This is sufficient for us (especially because it matches our baseline). It also appears to cause the QoR to be much more stable, at least in the experiments I am running on bitcoin_miner. For example, the CPD used to bounce between 8 and 10 ns; now it consistently sits around 7.75 ns. Just an interesting side effect.
For completeness, I also tested a different approach by splitting the backward cost term into congestion and delay components and maintaining them in the global structure. Specially handling how pruning is done. This also resolved the issue; however, it is much more complex and harder to maintain. For now we will stick to capping the pres_fac.
Perhaps in the future this issue may come up again; maybe at that point people will want to change the floats for quads. Closing the issue for now. Thanks @vaughnbetz @amin1377 .
Great, thanks!