ncov
ncov copied to clipboard
Standardise parsing of missing / unknown metadata values
Context
Many scripts need to check if a value is "valid" before proceeding. These checks tend to be one-offs and therefore slightly different. This makes it hard to document / understand what to use for unknown / missing data. For instance, the construction of sequence recency uses conditionals:
if 'date_submitted' in d and d['date_submitted'] and d['date_submitted'] != "undefined":
whereas our diagnostic script uses a try/except approach for the same field:
try:
return (datetime.strptime(x,"%Y-%m-%d") - timedelta(weeks=minus_weeks)).toordinal()
except:
return np.nan
This results in different behavior if, for example, we had a date submission value of "?"
Potential solutions
The first step would be to document the checks we use in our various scripts (including augur). Ideally there would be a reusable function which each script can use.