ncov
ncov copied to clipboard
Incorporate sample size in logistic growth model fit
https://github.com/nextstrain/ncov/blob/24f5256be44b61eb4b0e6ca2789205c49010e2e1/scripts/calculate_delta_frequency.py#L142
The logistic growth model fit is based on KDE estimates of variant frequency. In time intervals where there are not very many samples, these estimates can be rather poor, distorting the trend. One way to compensate for this is by using a weighted linear regression, which counts errors in different data points by different amounts. This is a fast substitute for a proper bayesian approach, using a gaussian/linear approximation.
The loss function would be
Sum (f(x_i) - y_i)^2/(var(y_i|x_i)) instead of the usual Sum (f(x_i) - y_i)^2.
where x = week, y = logit(frequency of variant in week), and var(y|x) is due to finite sample effects, rescaled by the squared derivative of logit. This is implemented in statsmodels and sklearn.
Back-of-the-envelope calculation of weights:
Finite sample correction: if you observe k out of n, then the variance of k/n is
k(n-k)/(nn(n+1))
Jacobian of logit: the logit will rescale a frequency x to log(x) - log(1-x), with derivative 1/x + 1/(1-x). The variance of the logit of frequency is the variance due to sampling rescaled by the square of the derivative of the logit. Plugging in the observation k/n to this formula gives n/k + n/(n-k).
Put this all together to get a variance in the logit observation of approximately
var(y|x) = k(n-k)/(n+1) * (1/k + 1/(n-k))^2
When k is small and n is large, this is about 1/k; when k = n/2, this is about 4/n.
The appropriate weights for the regression would be the reciprocals of those. So an error at small k would be weighted by about k, where an error for large k would be weighted by about n/4.
(If n is small, you may want to use a beta(1/2,1/2) prior, which amounts to adding 0.5 to k and 1 to n. This is what the outbreak.info team does for error bars on their frequency calculations.)
While this is derived for integer numbers of samples, one could plug in the KDE-smoothed values of k (number of variant) and n (number of samples) to get weights for the linear regression you're already doing on smoothed data just as well.
An alternative would be to properly maximize the likelihood. I played with this a few weeks ago, but I think it needs to be done numerically. Here observations
are time points (datetime.to_ordinal()
for simplicity) at which the variant is observed, all_time_points
include all observations.
def cost(x, obs, t):
st_obs = np.clip(x[0]*(obs-x[1]), -100,100)
st_all = np.clip(x[0]*(t-x[1]), -100,100)
return -np.sum(st_obs) + np.sum(np.log(1+np.exp(st_all)))
def jac(x, obs, t):
st_obs = np.clip(x[0]*(obs-x[1]), -100,100)
st_all = np.clip(x[0]*(t-x[1]), -100,100)
est = np.exp(st_all)
denom = est/(1+est)
return np.array([ -np.sum(obs-x[1]) + np.sum((t-x[1])*denom),
x[0]*(len(obs) - np.sum(denom))])
y_obs, bins = np.histogram(observations, bins=4)
y_all, b = np.histogram(all_time_points, bins=bins)
logit = np.log((y_obs+1)/(y_all-y_obs+1))
bc = 0.5*(bins[1:] + bins[:-1])
res = linregress(bc, logit)
sol = minimize(cost, [res.slope, np.clip(-res.intercept/res.slope, 737000,738500)],
args=(observations, all_time_points), jac=jac, method=method, tol=1e-5)
Minimizing this takes a second or so...