rigraph icon indicating copy to clipboard operation
rigraph copied to clipboard

p-value missing for `fit_power_law()`

Open krlmlr opened this issue 1 year ago • 19 comments

What happens?

fit_power_law() no longer returns a KS.p component.

To Reproduce

igraph 1.6.0

options(conflicts.policy = list(warn = FALSE))
library(igraph)
g <- sample_pa(1000) # increase this number to have a better estimate
d <- degree(g, mode = "in")
fit_power_law(d + 1, 10)
#> $continuous
#> [1] FALSE
#> 
#> $alpha
#> [1] 2.738748
#> 
#> $xmin
#> [1] 10
#> 
#> $logLik
#> [1] -62.25711
#> 
#> $KS.stat
#> [1] 0.1037705
#> 
#> $KS.p
#> [1] 0.986668

Created on 2024-01-23 with reprex v2.1.0

igraph 2.0.0

options(conflicts.policy = list(warn = FALSE))
library(igraph)
g <- sample_pa(1000) # increase this number to have a better estimate
d <- degree(g, mode = "in")
fit_power_law(d + 1, 10)
#> $continuous
#> [1] FALSE
#> 
#> $alpha
#> [1] 3.226469
#> 
#> $xmin
#> [1] 10
#> 
#> $logLik
#> [1] -63.90509
#> 
#> $KS.stat
#> [1] 0.07306791

Created on 2024-01-23 with reprex v2.1.0

krlmlr avatar Jan 23 '24 18:01 krlmlr

Are there any news on this issue? I have seen that you declared this bug as "breaking change" of igraph 2.0.0 in your NEWS.md ("fit_power_law() no longer returns a KS.p component.")

However, without KS.p component, the function fit_power_law() is useless, at least, in the situations in which I need to use this function for power-law fitting. So, declaring this a "breaking change" is not a solution to the problem. Did you already investigate what had happened and why the component is missing now? Could you please fix this problem as soon as possible in your next release? I guess I am not the only user who needs the KS.p component of power-law fitting. Thanks in advance!

bockthom avatar Apr 12 '24 12:04 bockthom

The underlying C core function no longer returns this component. My attempts to use the new igraph_plfit_result_calculate_p_value() didn't give satisfactory results, so I gave up.

@jhollway: You mentioned in https://github.com/stocnet/migraph/issues/284 that you could compute the p-value yourself. Would you like to share a pointer?

@szhorvat @ntamas: Appreciate your support here too.

krlmlr avatar Apr 13 '24 15:04 krlmlr

In theory it should be simple:

  • igraph_power_law_fit() gives you an igraph_plfit_result_t object, which is the result of the fitting procedure.
  • igraph_igraph_plfit_result_calculate_p_value() requires the result object as an input, along with a precision (say, 0.001), and then it gives you the p-value in an output argument.

ntamas avatar Apr 13 '24 15:04 ntamas

I remember testing with the example given here, and the p-value came out as zero.

@Antonov548: can you please give it another try?

krlmlr avatar Apr 13 '24 15:04 krlmlr

That is an entirely valid p-value, depending on the input.

Try data/discrete_data.txt from https://github.com/ntamas/plfit as an input. When using the exact p-value calculation method, the p-value should turn out to be around 0.63 (alpha = 2.5833, xmin = 2, L = -9155.6).

ntamas avatar Apr 13 '24 17:04 ntamas

I ran with the example in the OP, there the p-value comes out as 0.98. Are you expecting consistency in the p-values between igraph 1.6.0 and igraph >= 2.0.0?

krlmlr avatar Apr 14 '24 18:04 krlmlr

No, the calculation changed between igraph 0.9 and 0.10. The previous calculation was faster but poorer.

See https://github.com/igraph/igraph/blob/master/CHANGELOG.md :

igraph_power_law_fit() does not calculate the p-value automatically any more because the previous estimation method did not match the results from the original paper of Clauset, Shalizi and Newman (2009) and the implementation of the method outlined in the paper runs slower than the previous naive estimate. A separate function named igraph_plfit_result_calculate_p_value() is now provided for calculating the p-value. The automatic selection of the x_min cutoff also uses a different method than earlier versions. As a consequence, results might be slightly different if you used tests where the x_min cutoff was selected automatically. The new behaviour is now consistent with the defaults of the underlying plfit library.

szhorvat avatar Apr 14 '24 19:04 szhorvat

I ran with the example in the OP, there the p-value comes out as 0.98

I can get 0.98 only with the current implementation of plfit if I use the older approximation method and also force the system to treat the incoming sample as continuous instead of discrete.

Discrete input, exact method

❯ src/plfit ../data/discrete_data.txt -p exact
../data/discrete_data.txt:
	Discrete MLE
	alpha =      2.58330
	xmin  =      2.00000
	L     =  -9155.61701
	D     =      0.00433
	p     =      0.62280

Continuous input, exact method

❯ src/plfit ../data/discrete_data.txt -p exact -c
../data/discrete_data.txt:
	Continuous MLE
	alpha =      2.75401
	xmin  =     25.00000
	L     =   -295.89683
	D     =      0.05713
	p     =      0.62200

Discrete input, old approximation

❯ src/plfit ../data/discrete_data.txt -p approximate
../data/discrete_data.txt:
	Discrete MLE
	alpha =      2.58330
	xmin  =      2.00000
	L     =  -9155.61701
	D     =      0.00433
	p     =      0.99995 (approximation)

Continuous input, old approximation

❯ src/plfit ../data/discrete_data.txt -p approximate -c
../data/discrete_data.txt:
	Continuous MLE
	alpha =      2.75401
	xmin  =     25.00000
	L     =   -295.89683
	D     =      0.05713
	p     =      0.97629 (approximation)

Which of these results (if any) are you getting @krlmlr ?

ntamas avatar Apr 14 '24 20:04 ntamas

@krlmlr are the next steps

  • your trying again
  • and then... the change being documented?

maelle avatar May 23 '24 06:05 maelle

I added a stub. To be continued:

  • C: Implement R_igraph_power_law_fit_new() in interface-extra.c, forwarding to R_igraph_power_law_fit() and to igraph_igraph_plfit_result_calculate_p_value() in case the p-value is to be calculated
  • R: Add p.value argument to power.law.fit.new()

@Antonov548: can you please pick up my PR #1386?

krlmlr avatar Jun 01 '24 17:06 krlmlr

@krlmlr Thanks for addressing this issue. Just a quick question: Do I understand it correctly that you aim at having a new implementation variant "plfit.p" or a new function power.law.fit.new() instead of fixing the previously existing implementation "plfit" in fit_power_law? If so, may I ask why you add a new function / a new parameter value instead of fixing the existing function and implementation variant "plfit"? If the interface / parameter value changes, that would imply breaking changes to all users, but I don't see any reason for breaking changes instead of just fixing the previous implementation. Or is this just for debugging / testing and it will eventually replace the existing functionality using the previous interface?

bockthom avatar Jun 01 '24 17:06 bockthom

Good point. The reason for distinguishing this is that the computation of the p-value is expensive and optional. What should the default be in your opinion? Slow and batteries included, or faster with potential for confusion?

I guess this boils down to understanding how much the overhead really is for real-world scenarios. @ntamas: is that microseconds, milliseconds, or seconds, for common use cases?

krlmlr avatar Jun 02 '24 10:06 krlmlr

Seconds or even more, compared to milliseconds with the previous version (which used an approximation that was not even scientifically valid).

ntamas avatar Jun 02 '24 10:06 ntamas

Let's have the user opt in, then. It now remains a question whether to add a new argument or keep the proposal (and also document the reason for this being opt in). I'm fine either way.

krlmlr avatar Jun 02 '24 10:06 krlmlr

Good point. The reason for distinguishing this is that the computation of the p-value is expensive and optional. What should the default be in your opinion? Slow and batteries included, or faster with potential for confusion?

Good question. If you ask me that without knowing about the previous interface, I would probably vote for the faster variant. And if it is even more than seconds to compute the p value, it would be reasonable that the faster variant should be the default. The other question is: Do you have any usage statistics on how users use this function? I cannot imagine using power-law fitting without obtaining a p-value. Without a p-value, the whole function is completely useless in my context. And there are also other users that missed the p-value already (see the other issue linked above). So, if almost everybody needs the p-value anyway (as in my case), I would vote for the p-value calculation being the default even if it takes longer, because then the fast variant would be inappropriate in almost all cases and who does not need the p-value should indicate this when using the function. So, I am torn between the two different arguments for the two different choices of a default value.

bockthom avatar Jun 02 '24 10:06 bockthom

Let's have the user opt in, then. It now remains a question whether to add a new argument or keep the proposal (and also document the reason for this being opt in). I'm fine either way.

I would suggest to just keep "plfit" and remove "plfit.p" as I assume that both implementations lead to exactly the same result except for one not returning a p-value, right? So, as a user, it would be confusing for me why to differentiate between two implementations if they are actually the same. Instead, I would add an optional parameter "include.p.value" that indicates whether a p-value should be returned. This could also be used for other implementations than "plfit" - then it is to make sure that either always a p-value is returned or not, independent of which implementation method is chosen (if this makes sense - I actually don't know about other methods than "plfit"). And this is a breaking change, it would also sound more reasonable to me that a new optional parameter is added that allows additional configuration than changing the method that I have used for years... But, that's just my spontaneous thoughts on that. Not sure how others think about it.

bockthom avatar Jun 02 '24 10:06 bockthom

I would add an optional parameter "include.p.value"

This makes sense to me as well.

szhorvat avatar Jun 02 '24 10:06 szhorvat

@maelle: Can you please help with refining the interface as needed?

krlmlr avatar Jun 02 '24 16:06 krlmlr

@maelle: To recap:

  • We need an extension of the C core. I started from the R side in #1386, we need @Antonov548's support to provide the relevant C counterpart, as described in https://github.com/igraph/rigraph/issues/1158#issuecomment-2143521483, and also a bit more on the R side
  • We also need tests

krlmlr avatar Jul 04 '24 09:07 krlmlr

Thanks for addressing this issue in PR #1386 @krlmlr @maelle. However, I would like to re-open this issue, because it has not been fixed in the intended way:

In PR #1386, you have changed two R functions, namely fit_power_law and power.law.fit.new. power.law.fit.new has been extended by a new parameter p.value, but fit_power_law not. And fit_power_law is the function which the end users use. Consequently, the implementation in PR #1386 is not the proposed solution that @szhorvat has agreed on. See comments https://github.com/igraph/rigraph/issues/1158#issuecomment-2143791347 and https://github.com/igraph/rigraph/issues/1158#issuecomment-2143796048.

So, could you please remove plfit.p from the implementation parameter and instead add a include.p.value parameter, as reasoned in the discussion above? Thanks!

bockthom avatar Sep 02 '24 09:09 bockthom

PR welcome :wink: Thanks in any case!

maelle avatar Sep 02 '24 11:09 maelle

I'm seeing, with the current main branch:

options(conflicts.policy = list(warn = FALSE))
library(igraph)
g <- sample_pa(1000) # increase this number to have a better estimate
d <- degree(g, mode = "in")
fit_power_law(d + 1, 10, implementation = "plfit.p")
#> $continuous
#> [1] FALSE
#> 
#> $alpha
#> [1] 2.750385
#> 
#> $xmin
#> [1] 10
#> 
#> $logLik
#> [1] -48.99338
#> 
#> $KS.stat
#> [1] 0.09369558
#> 
#> $KS.p
#> [1] 0

Created on 2024-09-05 with reprex v2.1.0

Is the returned p-value correct?

I'd like to see a test in our codebase with a "known good" example.

After establishing correctness, we can discuss the user interface.

krlmlr avatar Sep 05 '24 09:09 krlmlr

@ntamas Can you see above? Isn't the method stochastic, meaning that a specific value shouldn't be tested for?

szhorvat avatar Sep 05 '24 09:09 szhorvat

Are there any news on this? (Sorry for asking again.)

Unfortunately, I on myself cannot comment on the correctness of the p value. However, I can provide a draft for the implementation of the discussed interface that gets include.p.value as a new parameter instead of using the cumbersome "plfit.p":

fit_power_law <- function(
    x,
    xmin = NULL,
    start = 2,
    force.continuous = FALSE,
    implementation = c("plfit", "R.mle"),
    include.p.value = FALSE, # default is FALSE, to prevent long computation times if p value is not needed
    ...) {

  implementation <- igraph.match.arg(implementation)

  if (implementation == "r.mle") {
    power.law.fit.old(x, xmin, start, ...)
    # TODO: not sure what to do with the p-value here. p-value should be removed if include.p.value is FALSE...

  } else if (implementation == "plfit") {
    if (is.null(xmin)) xmin <- -1
    power.law.fit.new(
      x,
      xmin = xmin,
      force.continuous = force.continuous,
      p.value = include.p.value
    )
  }
}

bockthom avatar Sep 23 '24 15:09 bockthom

Thanks. After establishing correctness, we can discuss the user interface. I'm still waiting for a good example I can include in the tests.

krlmlr avatar Sep 26 '24 08:09 krlmlr

Tamás gave a few examples in this thread above, are those something you can use for tests?

szhorvat avatar Sep 26 '24 08:09 szhorvat

@ntamas: Can you please help translate these examples into R? I see some plfit executable is used, I can't parse that.

krlmlr avatar Sep 26 '24 08:09 krlmlr

I know that Tamás is overwhelmed now, and has deadlines coming up, including our new grant application.

The test input files are here: https://github.com/ntamas/plfit/tree/main/data You know best how to read these with R.

The output he quoted states the parameters you should use (names are same as in R), as well as the output you should expect. Since the method is stochastic, expect a variation of the output, and use comparison with tolerance. Setting a seed is insufficient, as the implementation also supports parallelization.

szhorvat avatar Sep 26 '24 08:09 szhorvat

For some reason, I thought the input data must be a graph. I never realized the input is just a vector of numbers 🤦

krlmlr avatar Sep 26 '24 23:09 krlmlr

Yeah, it just fits a Pareto distribution.

szhorvat avatar Sep 27 '24 08:09 szhorvat