Determinism when resolving atom alternative locations
Problem description:
In many mmCIF files, atoms exist equally in two locations, each with an occupancy of 0.5. For example, this screenshot from PDB ID 1adl:
Currently, when calling get_structure() with altloc=occupancy, there's some instability present in which altloc is chosen for a given residue in such situations. Sometimes the A conformation is chosen; sometimes the B conformation. From what I can tell, the lack of determinism may arise from the use of a set to store altlocs within the filter_highest_occupancy_altloc function here.
Although either choice is theoretically correct, such instability makes results much more challenging to debug.
Proposed solution:
Change the implementation of filter_highest_occupancy_altloc to be deterministic when two altlocs have equal occupancy; e.g., use a sorted list() rather than a set. Given the length of typical altloc sets, there will be minimal to no performance impact.
Hi, thanks for pinpointing the problem. I think your analysis is correct. If I remember correctly, the set ensures that each altloc ID only appears once, but there is no reason to not enveloping the set with sorted().
Would you like to create a fix?
Hi,
Yes, we definitely need the set() since we're storing one altloc ID per atom, and thus have duplicates within a residue. Enveloping with sorted() should be a quick fix. I'll make a PR shortly! Thanks for your quick response!
Do you still plan to create a PR? Otherwise I would prepare the fix.