Possible bug in DCAwareRoundRobinPolicy
Code under question is in the DCAwareRoundRobinPolicy.distance:
https://github.com/scylladb/python-driver/blob/1c3cff88f4c0cf799efd2fbbd9443157dada09a9/cassandra/policies.py#L257-L264
This peace of the code creates a split in DCAwareRoundRobinPolicy.distance behavior for case when used_hosts_per_remote_dc is not empty:
- if node was added to the policy and remote dc bucket does not contain more than
used_hosts_per_remote_dcit will returnHostDistance.REMOTE - Otherwise it will return
HostDistance.IGNORE
Which means that if code uses .distance before on_add it will yield wrong result if this node fits into used_hosts_per_remote_dc.
Now you should always consider it when using lbp.distance.
Just one exampe that I managed to eyeball is in Cluster.on_add:
https://github.com/scylladb/python-driver/blob/1c3cff88f4c0cf799efd2fbbd9443157dada09a9/cassandra/cluster.py#L2067-L2089
So it won't finalize properly for node from remote dc and won't run self._prepare_all_queries(host) for it.
Just becase it is doing:
distance = self.profile_manager.distance(host)
if distance != HostDistance.IGNORED:
self._prepare_all_queries(host)
log.debug("Done preparing queries for new host %r", host)
self.profile_manager.on_add(host)
Instead of
self.profile_manager.on_add(host)
distance = self.profile_manager.distance(host)
if distance != HostDistance.IGNORED:
self._prepare_all_queries(host)
log.debug("Done preparing queries for new host %r", host)
As I understand it was done to signal upper logic to connect only to a certain number of remote dc nodes.
My proposal, instead of going around the code and fixing distance usage to make sure that everywhere it is used to decide whether node is worth of connecting to or not, to delegate this functionality to another method on lbp like should_connect(host):bool and use it there for that purpose, removing self._dc_live_hosts from distance calculation.
Hmm... by looking at the code it looks like you are right. But if that were the case, then the driver would never connect to nodes outside the current DC, right? I don't think that's the case, so maybe we are somehow misunderstanding the code - or it is the case and somehow no one noticed and no test checked this? We should check if that really is the case or if we misunderstood something before trying to fix it.
If we were to fix this: I'd be strongly against other method. There should be only one (or maybe a few, but a low number) paths were nodes are added to policy. Those should be fixed (which will improve the driver and possibly uncover other bugs by reading code) instead of complicating code.