covid19hub
covid19hub copied to clipboard
Package: fit discretised distributions to match expected distributional properties
Package: fit discretised distributions to match expected distributional properties
Description
Disclaimer: if this exists already, please feel free to point it out here.
What we need: fit a discretised distribution (distcrete
object) so that it matches user-specified properties e.g. quantiles (typically the median, and 25% / 75% quantiles)
The package should include documentation, simple reproducible examples, unit tests, and use good practices and standards recommended by RECON or similar.
Specs (option 1)
This version would be slightly easier to use, but more constrained as we explicitely force fitting using quantiles. The method for calculating distances from targets is likely pre-programmed in the function, so only a few options will be available (likely: least square, absolute values, etc).
-
inputs:
-
name
: a singlecharacter
indicating the family of a distribution; I would recommend usinggamma
andweibull
for testing purposes, as these are often the distributions we end up using; seename
in?distcrete::distcrete
-
target_quantiles
: anumeric
vector providing quantile used for fitting (between 0 and 1) -
target_values
: anumeric
vector providing quantiles values to be fitted to -
fit_method
: a singlecharacter
indicating which fitting method should be used (default to least squares) -
...
other arguments passed todistcrete::distcrete
-
-
outputs:
- a
distcrete
object with quantile properties matching the user-specified quantiles as closely as possible
- a
Specs (option 2)
This version would be more flexible targets and fitting are passed as user-specified functions.
-
inputs:
-
name
: a singlecharacter
indicating the family of a distribution; I would recommend usinggamma
andweibull
for testing purposes, as these are often the distributions we end up using; seename
in?distcrete::distcrete
-
summary
: afunction
with a single argumentx
calculating a distributional summary of its input -
distance
: afunction
with 2 argumentsx
andy
, calculating a dissimilarity betweenx
andy
, as a singlenumeric
, finite value -
...
other arguments passed todistcrete::distcrete
-
-
outputs:
- a
distcrete
object with quantile properties matching the user-specified quantiles as closely as possible
- a
Relevant related packages
Starting point
The following code generates a discretised weibull as a distcrete
object, generates a sample of 1e5 values, and calculates the sum of squared distances from reference quantiles:
## define targets
target_quantiles <- c(.25, .5, .75)
target_values <- c(3, 5, 7)
## make distcrete object
x <- distcrete::distcrete("weibull", shape = 2, scale = 4, w = 0.5, interval = 1)
## check quantiles
x_quantiles <- quantile(x$r(1e5), probs = target_quantiles)
sum_square <- sum((x_quantiles - target_values)^2)
Further tweaks will be putting this insided an optimiser. The interface should allow accessing all arguments of distcrete::distcrete
in addition to arguments used to specify targets and fitting
Impact
Many epidemiologically relevant delays are shared in the literature as means, medians, IQR, 95%CI etc. However, it can be very tricky to turn these into discretised distributions which we can then use for forecasting, estimating transmissibility etc. A simple utilitary function implementing this would be a great help for this.
Proposed timeline
- first viable product: as soon as possible; proposed: 25th April
- first stable version (CRAN release): less urgent, the main thing is to have this package, tested, documented, and publicly accessible
Focal point
@thibautjombart
I'm not sure that least squares is the best measure of goodness of fit. The documentation of fitdistrplus::fitdist() indicates that goodness of fit measures tend not to be appropriate for discrete distributions. Is it possible to write out a d, p and q function for the discredited gamma? If so, this could perhap fit into the fitdistrplus framework.
Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)
Good point re fitdistrplus
. distcrete
objects are lists with $d
$r
$q
$p
components.
I'd second just adding some scaffolding around fitdistrplus... it's a pretty solidly designed package.
The rriskDistributions might also be related to this issue . This package proposes several build get."distr".par functions that will return the parameters of the "distr" distribution where the pth percentiles match with the quantiles q. https://cran.r-project.org/web/packages/rriskDistributions/index.html
Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)
Do you have or could you get the raw data? There are a couple of approaches you could use if so. See https://cran.rstudio.com/web/packages/PDQutils/index.html and https://cran.r-project.org/web/packages/gld/.
Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)
Do you have or could you get the raw data? There are a couple of approaches you could use if so. See https://cran.rstudio.com/web/packages/PDQutils/index.html and https://cran.r-project.org/web/packages/gld/.
Nope, this project really is for cases where we don't have data, only some distributional summaries.
@samclifford @johnurbanik I thought fitdistrplus
was only implementing procedures for fitting distributions to data, which would be a different use-case. I may not be up to date. Is there a specific function to create distributions matching distributional summaries you can point to?
The rriskDistributions might also be related to this issue . This package proposes several build get."distr".par functions that will return the parameters of the "distr" distribution where the pth percentiles match with the quantiles q. https://cran.r-project.org/web/packages/rriskDistributions/index.html
Yes, this is very close. I have just given it a try and it seems to work very nicely. Only caveat is the discretisation may actually change results a bit, so I am not sure if the optimal parametrisation to match the continuous distribution remains optimal to match against the discrete version. I'd guess it is not in the general case, but should be pretty close?
Example below for a GAmma with median 7 and IQR 3-11:
library(rriskDistributions)
library(distcrete)
## get parameters of a Gamma with median 7, IQR 3-11
pars <- get.gamma.par(p = c(.25, .5, .75), q = c(3, 7, 11), plot = FALSE)
#> The fitting procedure 'L-BFGS-B' was successful!
#> $par
#> [1] 1.3647791 0.1615613
#>
#> $value
#> [1] 0.0001973927
#>
#> $counts
#> function gradient
#> 42 42
#>
#> $convergence
#> [1] 0
#>
#> $message
#> [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
pars
#> shape rate
#> 1.3647791 0.1615613
## check that it atcually works
samp_cont <- rgamma(1e5, shape = pars["shape"], rate = pars["rate"])
summary(samp_cont)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00099 3.16772 6.51986 8.46700 11.65060 76.12839
## discretise the distribution
x <- distcrete("gamma", shape = pars["shape"], rate = pars["rate"], interval = 1)
samp_disc <- x$r(1e5)
summary(samp_disc)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.000 3.000 7.000 8.468 12.000 92.000
Created on 2020-04-14 by the reprex package (v0.3.0)
Still, given how far rriskDistributions gets us, I think this project will make a more meaningful contribution if we go for option 2, i.e. a more general-purpose package to create distcrete
objects matching any summary function.
This could be a starting point for option 2: https://github.com/thibautjombart/misc/blob/master/R/distcreate.R