GerryChain
GerryChain copied to clipboard
The same updater on Partitions with different units gives incorrect Elections results
Consider this setup, where we try to generate two GerryChain Partitions, one built from the North Carolina Census block dual graph, the other from the VTD dual graph.
from gerrychain import Graph, Partition, Election
graph_blocks = Graph.from_json("nc_bcks.json")
graph_vtds = Graph.from_json("NC_full_vote_vtd.json")
We want to track the outcome of any election for which we have data on both dual graphs. For example, we can use the 2020 Governor's election, where the Democratic and Republican candidates' vote totals are stored in the CooperD_20G_Governor
and ForestR_20G_Governor
columns, respectively. We can check that the vote totals for each candidate are the same in both dual graphs:
import math
for party_column in ["CooperD_20G_Governor", "ForestR_20G_Governor"]:
block_sum = sum([graph_blocks.nodes[n][party_column] for n in graph_blocks.nodes])
vtd_sum = sum([graph_vtds.nodes[n][party_column] for n in graph_vtds.nodes])
assert math.isclose(block_sum, vtd_sum)
But if we use an Elections updater to track this election in a GerryChain Partition, votes are lost in the VTD partition! The final assert
statement will return an error
updaters = {"GOV20": Election("GOV20", {"Dem":"CooperD_20G_Governor", "Rep":"ForestR_20G_Governor"})}
# this bug is independent of the partition assignment, so here we assign every node to district 0
partition_blocks = Partition(graph_blocks, {n:0 for n in graph_blocks.nodes}, updaters)
partition_vtds = Partition(graph_vtds, {n:0 for n in graph_vtds.nodes}, updaters)
assert math.isclose(partition_blocks["GOV20"].totals[0],partition_vtds["GOV20"].totals[0])
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/var/folders/nc/1qx970qn7p17yq9cmwrq9fmw0000gn/T/ipykernel_96944/2002769006.py in <module>
----> 1 assert math.isclose(partition_blocks["GOV20"].totals[0],partition_vtds["GOV20"].totals[0])
AssertionError:
This error is propagated out so it will be most visible to the user as:
partition_blocks["GOV20"].percent("Dem") == 0.5228991188220035
partition_blocks["GOV20"].percent("Rep") == 0.5408567558910189
...where only the blocks Partition is correct.
This bug disappears if you pass different (though identical) updaters to each partition:
updaters_blocks = {"GOV20": Election("GOV20", {"Dem":"CooperD_20G_Governor", "Rep":"ForestR_20G_Governor"})}
updaters_vtds = {"GOV20": Election("GOV20", {"Dem":"CooperD_20G_Governor", "Rep":"ForestR_20G_Governor"})}
partition_blocks = Partition(graph_blocks, {n:0 for n in graph_blocks.nodes}, updaters_blocks)
partition_vtds = Partition(graph_vtds, {n:0 for n in graph_vtds.nodes}, updaters_vtds)
assert math.isclose(partition_blocks["GOV20"].totals[0],partition_vtds["GOV20"].totals[0])
@apizzimenti can explain further why this is happening, and the fix is due to him!
This bug is a subtle one and should only crop up iff
- the vertices in each dual graph (in the above example corresponding to blocks and VTDs) are integer indexed such that the vertex indices of one dual graph are a subset of the other dual graph's vertex indices;
- the same dictionary of updaters is used for each
Partition
.
The Partition
class, when initialized, checks whether an argument has been passed to the parent
parameter. If not, it assumes it's the first Partition
in a series of Partition
s and calls the _first_time()
private method. Among other things, this method assigns the passed dictionary of user-defined updaters to the Partition
instance's Partition.updaters
property, but does so in a sneakily cheap way: while it .copy()
s the dictionary of default updaters before assignment, it does not .copy()
the dictionary of user-defined updaters. As a result, the dictionary of updaters is shared across whichever Partition
objects were instantiated with them.
However, unintentional (and undesirable) reference-sharing across Partition
instances is not, in itself, the cause of this bug. Partition
s maintain an internal assignment which maps dual graph vertex indices to district numbers; broadly, updaters (like Tally
s and Election
s) use this mapping to compute districts' statistics. To save time, DataTally
instances only iterate completely over this assignment once, then recalculate statistics based on which vertices have moved (rather than recompute entirely). The keys in this mapping are the identifiers specified when the dual graph was created – typically, these are integers or GEOID
s. In the above example, both dual graphs' vertices are integer-indexed, so the VTD indices are a subset of the block indices. When the Election
updaters are initialized, they internally create DataTally
objects to tally up votes for the specified parties based on the current Partition
's assignment. Because these Election
s are initialized only once and the Partition
is not iterated – i.e. the Partition
's assignment isn't modified by flipping vertices from one district to another – the Election
updaters are evaluated multiple times on different, similarly-indexed assignments. Consequently, we get conflicting/overlapping/missing district assignments, causing funky or impossible values like the ones above.
The most sensible solution is to deep-copy dictionaries of updaters when instantiating Partition
s.
We should throw a UserWarning
when this occurs, I think (should be easy to detect).
So throw a UserWarning
and do not deep-copy dictionaries? Or do both?
Both, I think.
Love it – I'll put a PR and some tests in over the coming week and close this issue when I'm done.
It looks like this issue has been resolved in #375, so I am going to close this issue. Please let me know if I am mistaken and we need to reopen it.