Store titer records in TiterRecord class and export in structured JSON format
The goals of this proposal are:
- remove need for
evalcall when loading titers from JSON in augur’s process step - add support for multiple user-defined titer attributes such as the passaging details available from the CDC titers
- improve documentation of the JSON format for titers by explicitly naming fields instead of relying on slightly ambiguous dictionary format
- encapsulate logic about individual titer records along the lines of a SequenceRecord from BioPython such that each record knows how to export itself to JSON and also report other details about itself
The current JSON format looks like this:
"titers": {
"('A/Acores/11/2013', ('A/Alabama/5/2010', 'F27/10'))": [
80.0
],
"('A/Acores/11/2013', ('A/Athens/112/2012', 'F16/12'))": [
640.0
]
}
In this format, each record is a key/value pair where the key is a tuple of test strain, reference strain, and serum id that has been converted to a string for JSON compatibility. The value of each pair is a list of floating point values corresponding to raw titer measurements.
The new format should be a list of dictionaries where each dictionary corresponds to a TiterRecord instance in JSON format. Each entry in the TiterRecord should be explicitly named to remove ambiguity about the data and enable additional fields to be added in the future. For example, the following format can support inclusion of the optional “source” and “assay” fields that was originally omitted from each record.
"titers": [
{
"assay": "hi",
"test_strain": "A/Acores/11/2013",
"reference_strain": "A/Alabama/5/2010",
"serum": "F27/10",
"source": "NIMR_Sep2013_7-11.csv",
"values": [
80.0
]
},
{
"assay": "hi",
"test_strain": "A/Acores/11/2013",
"reference_strain": "A/Athens/112/2012",
"serum": "F16/12",
"source": "NIMR_Sep2013_7-11.csv",
"values": [
640.0
]
}
]
The records from this JSON format map directly to attributes of the TiterRecord Python class. In addition to these attributes, the TiterRecord class would expose the following methods.
class TiterRecord(object):
def __init__(self, test_strain, reference_strain, serum, values, **kwargs):
"""Builds a new TiterRecord instance.
Args:
test_strain (str): name of the test strain
reference_strain (str): name of the reference strain
serum (str): name of the serum
values (list): a list of raw floating point titer measurements
kwargs (dict): additional attributes of the TiterRecord instance
Returns:
TiterRecord: an instance of the record class populated with the given strains, serum, and values
>>> record = TiterRecord(test_strain="strain_a", reference_strain="strain_b", serum="serum_a", values=[80.0], assay="hi")
>>> record.test_strain
'strain_a'
>>> record.values[0]
80.0
>>> hasattr(record, "assay")
True
>>> record.assay
'hi'
>>> hasattr(record, "source")
False
>>> record_dict = {"test_strain": "strain_a", "reference_strain": "strain_b", "serum": "serum_a", "values": [80.0], "assay": "hi"}
>>> record = TiterRecord(**record_dict)
>>> record.test_strain
'strain_a'
"""
pass
def to_dict(self):
"""Returns the current instance as a dictionary.
Returns:
dict: attributes of the current instance as key/value pairs
>>> record = TiterRecord(test_strain="strain_a", reference_strain="strain_b", serum="serum_a", values=[80.0], assay="hi")
>>> sorted(record.to_dict().items())
[('assay', 'hi'), ('reference_strain', 'strain_b'), ('serum', 'serum_a'), ('test_strain', 'strain_a'), ('values', [80.0])]
"""
pass
The primary distinction between a TiterRecord and a dict is that the former class has required attributes. TiterRecord instances will not know how to export themselves into the dictionary format used by augur with tuple keys and list values; the TiterModel class should know how to convert a list of TiterRecord instances into that format. The TiterModel class should also know how to build a list of TiterRecord instances from a tab-delimited file of measurements.
@huddlej I'm reviewing open Augur issues to try to identify some Augur-related work I could do this year. Is this issue still relevant and desirable to do?
@genehack Thanks for checking! This issue is still relevant and desirable, although only for a small group of users (maybe only me), so it is still relatively low priority.
If I were to address this issue today, I would probably opt to load titers TSVs into a pandas data frame which provides the basic record-oriented structure I was hoping for originally.
I would also put off this change until I had added better test coverage of the titer models code, to give me more confidence that refactoring the data representation didn't break anything.