p-value missing for `fit_power_law()`
What happens?
fit_power_law() no longer returns a KS.p component.
To Reproduce
igraph 1.6.0
options(conflicts.policy = list(warn = FALSE))
library(igraph)
g <- sample_pa(1000) # increase this number to have a better estimate
d <- degree(g, mode = "in")
fit_power_law(d + 1, 10)
#> $continuous
#> [1] FALSE
#>
#> $alpha
#> [1] 2.738748
#>
#> $xmin
#> [1] 10
#>
#> $logLik
#> [1] -62.25711
#>
#> $KS.stat
#> [1] 0.1037705
#>
#> $KS.p
#> [1] 0.986668
Created on 2024-01-23 with reprex v2.1.0
igraph 2.0.0
options(conflicts.policy = list(warn = FALSE))
library(igraph)
g <- sample_pa(1000) # increase this number to have a better estimate
d <- degree(g, mode = "in")
fit_power_law(d + 1, 10)
#> $continuous
#> [1] FALSE
#>
#> $alpha
#> [1] 3.226469
#>
#> $xmin
#> [1] 10
#>
#> $logLik
#> [1] -63.90509
#>
#> $KS.stat
#> [1] 0.07306791
Created on 2024-01-23 with reprex v2.1.0
Are there any news on this issue? I have seen that you declared this bug as "breaking change" of igraph 2.0.0 in your NEWS.md ("fit_power_law() no longer returns a KS.p component.")
However, without KS.p component, the function fit_power_law() is useless, at least, in the situations in which I need to use this function for power-law fitting. So, declaring this a "breaking change" is not a solution to the problem. Did you already investigate what had happened and why the component is missing now? Could you please fix this problem as soon as possible in your next release? I guess I am not the only user who needs the KS.p component of power-law fitting.
Thanks in advance!
The underlying C core function no longer returns this component. My attempts to use the new igraph_plfit_result_calculate_p_value() didn't give satisfactory results, so I gave up.
@jhollway: You mentioned in https://github.com/stocnet/migraph/issues/284 that you could compute the p-value yourself. Would you like to share a pointer?
@szhorvat @ntamas: Appreciate your support here too.
In theory it should be simple:
-
igraph_power_law_fit()gives you anigraph_plfit_result_tobject, which is the result of the fitting procedure. -
igraph_igraph_plfit_result_calculate_p_value()requires the result object as an input, along with a precision (say, 0.001), and then it gives you the p-value in an output argument.
I remember testing with the example given here, and the p-value came out as zero.
@Antonov548: can you please give it another try?
That is an entirely valid p-value, depending on the input.
Try data/discrete_data.txt from https://github.com/ntamas/plfit as an input. When using the exact p-value calculation method, the p-value should turn out to be around 0.63 (alpha = 2.5833, xmin = 2, L = -9155.6).
I ran with the example in the OP, there the p-value comes out as 0.98. Are you expecting consistency in the p-values between igraph 1.6.0 and igraph >= 2.0.0?
No, the calculation changed between igraph 0.9 and 0.10. The previous calculation was faster but poorer.
See https://github.com/igraph/igraph/blob/master/CHANGELOG.md :
igraph_power_law_fit() does not calculate the p-value automatically any more because the previous estimation method did not match the results from the original paper of Clauset, Shalizi and Newman (2009) and the implementation of the method outlined in the paper runs slower than the previous naive estimate. A separate function named igraph_plfit_result_calculate_p_value() is now provided for calculating the p-value. The automatic selection of the x_min cutoff also uses a different method than earlier versions. As a consequence, results might be slightly different if you used tests where the x_min cutoff was selected automatically. The new behaviour is now consistent with the defaults of the underlying plfit library.
I ran with the example in the OP, there the p-value comes out as 0.98
I can get 0.98 only with the current implementation of plfit if I use the older approximation method and also force the system to treat the incoming sample as continuous instead of discrete.
Discrete input, exact method
❯ src/plfit ../data/discrete_data.txt -p exact
../data/discrete_data.txt:
Discrete MLE
alpha = 2.58330
xmin = 2.00000
L = -9155.61701
D = 0.00433
p = 0.62280
Continuous input, exact method
❯ src/plfit ../data/discrete_data.txt -p exact -c
../data/discrete_data.txt:
Continuous MLE
alpha = 2.75401
xmin = 25.00000
L = -295.89683
D = 0.05713
p = 0.62200
Discrete input, old approximation
❯ src/plfit ../data/discrete_data.txt -p approximate
../data/discrete_data.txt:
Discrete MLE
alpha = 2.58330
xmin = 2.00000
L = -9155.61701
D = 0.00433
p = 0.99995 (approximation)
Continuous input, old approximation
❯ src/plfit ../data/discrete_data.txt -p approximate -c
../data/discrete_data.txt:
Continuous MLE
alpha = 2.75401
xmin = 25.00000
L = -295.89683
D = 0.05713
p = 0.97629 (approximation)
Which of these results (if any) are you getting @krlmlr ?
@krlmlr are the next steps
- your trying again
- and then... the change being documented?
I added a stub. To be continued:
- C: Implement
R_igraph_power_law_fit_new()ininterface-extra.c, forwarding toR_igraph_power_law_fit()and toigraph_igraph_plfit_result_calculate_p_value()in case the p-value is to be calculated - R: Add
p.valueargument topower.law.fit.new()
@Antonov548: can you please pick up my PR #1386?
@krlmlr Thanks for addressing this issue. Just a quick question: Do I understand it correctly that you aim at having a new implementation variant "plfit.p" or a new function power.law.fit.new() instead of fixing the previously existing implementation "plfit" in fit_power_law? If so, may I ask why you add a new function / a new parameter value instead of fixing the existing function and implementation variant "plfit"? If the interface / parameter value changes, that would imply breaking changes to all users, but I don't see any reason for breaking changes instead of just fixing the previous implementation. Or is this just for debugging / testing and it will eventually replace the existing functionality using the previous interface?
Good point. The reason for distinguishing this is that the computation of the p-value is expensive and optional. What should the default be in your opinion? Slow and batteries included, or faster with potential for confusion?
I guess this boils down to understanding how much the overhead really is for real-world scenarios. @ntamas: is that microseconds, milliseconds, or seconds, for common use cases?
Seconds or even more, compared to milliseconds with the previous version (which used an approximation that was not even scientifically valid).
Let's have the user opt in, then. It now remains a question whether to add a new argument or keep the proposal (and also document the reason for this being opt in). I'm fine either way.
Good point. The reason for distinguishing this is that the computation of the p-value is expensive and optional. What should the default be in your opinion? Slow and batteries included, or faster with potential for confusion?
Good question. If you ask me that without knowing about the previous interface, I would probably vote for the faster variant. And if it is even more than seconds to compute the p value, it would be reasonable that the faster variant should be the default. The other question is: Do you have any usage statistics on how users use this function? I cannot imagine using power-law fitting without obtaining a p-value. Without a p-value, the whole function is completely useless in my context. And there are also other users that missed the p-value already (see the other issue linked above). So, if almost everybody needs the p-value anyway (as in my case), I would vote for the p-value calculation being the default even if it takes longer, because then the fast variant would be inappropriate in almost all cases and who does not need the p-value should indicate this when using the function. So, I am torn between the two different arguments for the two different choices of a default value.
Let's have the user opt in, then. It now remains a question whether to add a new argument or keep the proposal (and also document the reason for this being opt in). I'm fine either way.
I would suggest to just keep "plfit" and remove "plfit.p" as I assume that both implementations lead to exactly the same result except for one not returning a p-value, right? So, as a user, it would be confusing for me why to differentiate between two implementations if they are actually the same. Instead, I would add an optional parameter "include.p.value" that indicates whether a p-value should be returned. This could also be used for other implementations than "plfit" - then it is to make sure that either always a p-value is returned or not, independent of which implementation method is chosen (if this makes sense - I actually don't know about other methods than "plfit"). And this is a breaking change, it would also sound more reasonable to me that a new optional parameter is added that allows additional configuration than changing the method that I have used for years... But, that's just my spontaneous thoughts on that. Not sure how others think about it.
I would add an optional parameter "include.p.value"
This makes sense to me as well.
@maelle: Can you please help with refining the interface as needed?
@maelle: To recap:
- We need an extension of the C core. I started from the R side in #1386, we need @Antonov548's support to provide the relevant C counterpart, as described in https://github.com/igraph/rigraph/issues/1158#issuecomment-2143521483, and also a bit more on the R side
- We also need tests
Thanks for addressing this issue in PR #1386 @krlmlr @maelle. However, I would like to re-open this issue, because it has not been fixed in the intended way:
In PR #1386, you have changed two R functions, namely fit_power_law and power.law.fit.new. power.law.fit.new has been extended by a new parameter p.value, but fit_power_law not. And fit_power_law is the function which the end users use. Consequently, the implementation in PR #1386 is not the proposed solution that @szhorvat has agreed on. See comments https://github.com/igraph/rigraph/issues/1158#issuecomment-2143791347 and https://github.com/igraph/rigraph/issues/1158#issuecomment-2143796048.
So, could you please remove plfit.p from the implementation parameter and instead add a include.p.value parameter, as reasoned in the discussion above? Thanks!
PR welcome :wink: Thanks in any case!
I'm seeing, with the current main branch:
options(conflicts.policy = list(warn = FALSE))
library(igraph)
g <- sample_pa(1000) # increase this number to have a better estimate
d <- degree(g, mode = "in")
fit_power_law(d + 1, 10, implementation = "plfit.p")
#> $continuous
#> [1] FALSE
#>
#> $alpha
#> [1] 2.750385
#>
#> $xmin
#> [1] 10
#>
#> $logLik
#> [1] -48.99338
#>
#> $KS.stat
#> [1] 0.09369558
#>
#> $KS.p
#> [1] 0
Created on 2024-09-05 with reprex v2.1.0
Is the returned p-value correct?
I'd like to see a test in our codebase with a "known good" example.
After establishing correctness, we can discuss the user interface.
@ntamas Can you see above? Isn't the method stochastic, meaning that a specific value shouldn't be tested for?
Are there any news on this? (Sorry for asking again.)
Unfortunately, I on myself cannot comment on the correctness of the p value.
However, I can provide a draft for the implementation of the discussed interface that gets include.p.value as a new parameter instead of using the cumbersome "plfit.p":
fit_power_law <- function(
x,
xmin = NULL,
start = 2,
force.continuous = FALSE,
implementation = c("plfit", "R.mle"),
include.p.value = FALSE, # default is FALSE, to prevent long computation times if p value is not needed
...) {
implementation <- igraph.match.arg(implementation)
if (implementation == "r.mle") {
power.law.fit.old(x, xmin, start, ...)
# TODO: not sure what to do with the p-value here. p-value should be removed if include.p.value is FALSE...
} else if (implementation == "plfit") {
if (is.null(xmin)) xmin <- -1
power.law.fit.new(
x,
xmin = xmin,
force.continuous = force.continuous,
p.value = include.p.value
)
}
}
Thanks. After establishing correctness, we can discuss the user interface. I'm still waiting for a good example I can include in the tests.
Tamás gave a few examples in this thread above, are those something you can use for tests?
@ntamas: Can you please help translate these examples into R? I see some plfit executable is used, I can't parse that.
I know that Tamás is overwhelmed now, and has deadlines coming up, including our new grant application.
The test input files are here: https://github.com/ntamas/plfit/tree/main/data You know best how to read these with R.
The output he quoted states the parameters you should use (names are same as in R), as well as the output you should expect. Since the method is stochastic, expect a variation of the output, and use comparison with tolerance. Setting a seed is insufficient, as the implementation also supports parallelization.
For some reason, I thought the input data must be a graph. I never realized the input is just a vector of numbers 🤦
Yeah, it just fits a Pareto distribution.