augur
augur copied to clipboard
Standardize naming of strains/sequences/records
from #750
Maybe for later PRs: Standardize naming of "strains", "sequences", and "records"
This is something that's mildly bugged me while going through the codebase. Would love to standardize and this can probably be done without changing the experience for users, since it's mostly about internal variable names and documentation.
If I had to pick from the 3, my vote is on record for the following reasons:
- I think we want this term to refer to a data point that is either metadata, sequence, or a combination of both.
- strain seems too specific to the virology domain, and within that domain there's already some nuance in strain/mutant/variant (?)
- sequence is already a term for the actual DNA/RNA sequence.
- Biopython has a
SeqRecord
, meant for sequence (+ optional metadata). This sounds close to our use case. In practiceSeqRecord
instances are often variables namedrecord
.
I agree it would be nice to have some standardization there, but we may want to retain using a mix of the three as they can convey slightly different things. For me, intuitively, 'record' would likely imply metadata and not necessarily the sequence, whereas 'sequence' implies more that this is referencing the actual DNA sequence. For example, saying "we'll exclude all records with more than 10 mutations" would, to me, read very strangely. The reverse is a little less strict to my ear (eye?): "We'll exclude all sequences with region 'Europe'" would not raise my eyebrows. This may come from past experience - I'm not unused to working with data where I have more clinical records or diagnostic reports, etc, and only some of these have attached sequences. Given what I do, I generally am only concerned with the ones with sequences, though often might do basic summary/manipulation on the entire set to give an overview.
While I think it's probably not worth changing strain
in general in our internal coding/columns (this would likely break a lot of things for a lot of people), I would be happy to avoid this somewhat in documentation to be more precise, since strain can also mean a distinct pathogen group ("a new strain of X").
Broadly agree with @emmahodcroft here.
record
is also as generic as it gets, second only to data
, so I'm reticent to prefer that.
If we do choose to further converge on terms (whatever the terms are), I recommend we do it over time as we touch parts of the codebase for other work rather than try to make a sweeping change all at once. That is, make the preferred terms a part of our (informal) codebase "policy" for new/changed code. This avoids creating new work, is less effort since it's incremental, and is less likely to introduce accidental breakage since its integrated into related work that would be getting tested/reviewed more closely than a big find/replace would.