Add scorefile output convieniece function for PyRosetta
At the recent Bootcamp, several people (namely @LouisaMe09 and @zyajahuggan) expressed interest in the ability to create scorefile info from PyRosetta. Add a convenience function which allows easy creation of JD2-like scorefiles through the PyRosetta interface. (pyrosetta.poses_to_scorefile())
I've also added a function (pyrosetta.io.get_scorefile_info()) which gets what would be reported to the scorefile as a Python dictionary.
Additionally, this PR also cleans up some of the scorefile writing interface at the C++ level.
IIRC, there are similar functions in the pyrosetta.distributed module. May be nice to make sure they do the same thing/call the same code under the hood.
IIRC, there are similar functions in the pyrosetta.distributed module. May be nice to make sure they do the same thing/call the same code under the hood.
From what I can tell from a quick grep through the code, the use case for pyrosetta.distributed scorefile handling is quite a bit more complicated, and doesn't seem to be aimed at a simple scorefile output. I'm not sure how much overlap there is. (Thoughts, @klimaj ?)
Probably a bit outside the exact scope of this PR, but it feels like a good moment to bring up:
One thing I’ve repeatedly found myself doing over the years is parsing Rosetta score files into a more structured format—usually JSON. Maybe when we write a score file, we could also write a companion
Thoughts?
There are quite a few mechanisms to get a scores dictionary in the pyrosetta.distributed module, but I don't think they actually dump them to disk (which can easily be accomplished with the json module, but this PR intends to also support JD2-style scorefiles, so I think it's a bit different). But also note that the pyrosetta.distributed methods are obtaining scores from the (recently added) Pose.cache dictionary, whereas this PR is obtaining scores from:
rosetta.protocols.jd2.get_string_real_pairs_from_current_job()rosetta.protocols.jd2.get_string_string_pairs_from_current_job()
- These are really good to see added, because existing scorefile machinery does not include this data, to which some protocols [if I'm not mistaken, like
InterfaceAnalyzerMoverandShapeComplementarityFilter] write data
rosetta.core.io.raw_data.ScoreMap.add_arbitrary_score_data_from_pose()rosetta.core.io.raw_data.ScoreMap.add_arbitrary_string_data_from_pose()
- I do not recommend using these methods, since the data is prone to being clobbered silently -- not only can data get clobbered within each method itself, but it looks like data can get clobbered in the
scorefile_info()subfunction in this PR. For this reason, thePose.cachedictionary has been developed to warn the user about any clobbered data. I've outlined the data override precedences here, here, and here.
I think this PR is similar to the idea of the PyJobDistributor's output_scorefile function, however that is using another method that is prone to clobbering.
Finally, PyRosettaCluster has it's own job distributor code, which dumps a scorefile after distributing tasks, and is somewhat dissimilar to the motive of this PR (and is more complex than the implementation in this PR).
As another quick comment, arbitrary python types can now be serialized into strings using the Pose.cache dictionary (for example, pose.cache["foo"] = complex(1, 2)), but this PR (in its current form) does not deserialize them using the Pose.cache machinery or by implementing the PoseScoreSerializer.maybe_decode(value) method separately.
@roccomoretti instead of retrieving data from rosetta.core.io.raw_data.ScoreMap.add_arbitrary_score_data_from_pose() and rosetta.core.io.raw_data.ScoreMap.add_arbitrary_string_data_from_pose(), you may want to consider retrieving all data from Pose.cache in a single call, which retrieves that data with appropriate clobber warnings and does data deserialization automatically. Also note that your implementation of rosetta.core.io.raw_data.ScoreMap.add_energies_data_from_scored_pose() would also not be necessary since the Pose.cache dictionary retrieves pose energy data from pose.energies().active_total_energies().items(), which I think is the same data (but you'd have to double check).
One thing I’ve repeatedly found myself doing over the years is parsing Rosetta score files into a more structured format—usually JSON.
Jared added the ability to make JSON-formatted scorefiles from command-line Rosetta, via the -out:file:scorefile_format json option.
This PR uses that framework to allow you to output either the conventional (default) or JSON formats (with use_json=True). There's also the convenience function to get the same data as a Python dict.