tales-science-data icon indicating copy to clipboard operation
tales-science-data copied to clipboard

Power law additions

Open martinapugliese opened this issue 8 years ago • 1 comments

Identify a PL graphically

Pareto quantile-quantile plots (Q-Q plots), or log-log plots.

Q-Q plot (fare voce separata)

[from wikipedia]

means plotting the quantiles of two distribs one against the other. It is a parametric plot with parameter being the number (index) of the quantile. If the two distribs are similar, points will be on the diagonal. If they are linearly related, points will be on a line (not only the diagonal). Typically better than comparing histograms. P-P plot is the same thing but plots cumulative distrib functions.

For PLs (Pareto QQ plots), plot the quantiles of the log-transformed data against the quantiles of an exponential dist with mean 1

log-log plot

See if it's a line, but only good when there is lots of data

Estimate the exponent

  • Fitting (lin regre) the logs can lead to biased answers (see powlaw-empirical-data.pdf & blog post)

MLE

see wikipedia

KS

PL and lognorm2

Plotting

log -log scale and logarithmic binning, see Newman review cumulative plotting (called frequency/rank), see Newman review. Cumulative equals to frequency/rank because

  • Zipf and Pareto plot the cumulative and one plots y vs x, the other the reverse, see Newman review.

  • finding the exponent: crap to OLS, see formulas Newman review (and proof, it is MLE)

  • Few real world distribs follow a whole PL, most have a PL tali (see newman examples, try to find data and reproduce?). It adds difficulty in trying to find the exponent because one needs to know where to cut

  • also see http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html (cited by Newman)

  • maths of pow laws, see Newman again, see partivculry the discussion on the infinite mean, the distrib of wealth (80/20 rule, Lorenz curves)

  • Yule process for speciation, from Newman

Generating pow laws

[follows Newman]

  • combination of exps. See the example on word lenght and the how Shannon formulated his theory [deserves a separate chapter]

  • riche gets richer (aka preferential attachment) see Newman but expand?

  • Also see appendices in Newman

  • See and follow recipe pag 3 powlaw-empirical Newman paper

  • see in particular the distribution fit comparisons on the powerlaw paper

  • from same paper, note this " the central limit theorem. When random variables are summed, the result is the normal distribu- tion. However, when positive random variables are multiplied, the result is the lognormal distribution, which is quite heavy-tailed.", in generative mechs (read deeplypero')

  • definition Newman cumulative and wikipedia definition: ???

From plos python paper

  • "when presented on log axes should use logarithmic binning": refer to blog post on loglog

  • log binning means exponentially increasing bin widhts

  • this is because altough linear binning have high resolution in the entire range, the small prob of observing high values prevents computation of reliable prob there. Log binning inreases the likelihood of observing values in the tail, compensated by normalising with larger width All this^ deserves a blog post per se

  • pacchetto powerlaw uses log bining by default

  • the xmin discussion

  • the goodness of fit (in compare distributions)

  • exp being the minimum alternative for evaluating heavy-tailedness

martinapugliese avatar Feb 11 '17 22:02 martinapugliese

Also, for style:

  • [x] the plot title is not in LateX
  • [x] the references part: not all orange

martinapugliese avatar Feb 14 '17 09:02 martinapugliese