anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[DISCUSSION] Citation for Heap's law parameter values for Open/Closed pangenome

Open AshSudarshan opened this issue 6 months ago • 2 comments

Hi! Is there any reference that was used when Anvio documentation recommends an alpha value of 0.3 to determine whether a pangenome is relatively open/closed?

I am trying to use these thresholds for my work and it would be helpful if there was reference I could look into.

Alternatively, are these thresholds that you have internally (Anvio developers) determined to define open/closed pan genomes? If so, any paper you would recommend that I can cite for this cutoff?

I am specifically referring to the cutoffs mentioned here:https://merenlab.org/2016/11/08/pangenomics-v2/#calculating-rarefaction-curves-and-heaps-law-parameters

Thank you so much for your help!

AshSudarshan avatar Jun 17 '25 16:06 AshSudarshan

Hi @AshSudarshan. Thanks for bringing this up.

To be honest there is no absolute threshold to cite here, and 0.3 is just another made up cutoff, just like every other cutoff in biology :)

I'm sharing this from https://anvio.org/help/main/programs/anvi-compute-rarefaction-curves/ as a perspective on this topic:

On the utility of rarefaction curves and Heaps’ Law fit

Rarefaction curves are helpful in the analysis of pangenome as they help visualize the discovery rate of new gene clusters as a function of increasing number of genomes. While a steep curve suggests that many new gene clusters are still being discovered, indicating incomplete coverage of the potential gene cluster space, a curve that reaches a plateau suggests sufficient sampling of gene cluster diversity.

However, rarefaction curves have inherent limitations. Because genome sampling is often biased and unlikely to fully capture the true genetic diversity of any taxon, rarefaction analysis provides only dataset-specific insights. Despite these limitations, rarefaction curves remain a popular tool for characterizing whether a pangenome is relatively ‘open’ (with continuous gene discovery) or ‘closed’ (where new genome additions contribute few or no new gene clusters). As long as you take such numerical summaries with a huge grain of salt, it is all fine.

Fitting Heaps’ Law to the rarefaction curve provides a quantitative measure of pangenome openness. The alpha value derived from Heaps’ Law (sometimes referred to as gamma in the literature) reflects how the number of new gene clusters scales with increasing genome sampling. There is no science to define an absolute threshold for an open or a closed pangenome. However, pangenomes with alpha values below 0.3 tend to be relatively closed, and those above 0.3 tend to be relatively open. Higher alpha values will indicate increasingly open pangenomes and lower values will identify progressively closed ones.

IMO, you can simply justify the cutoff you are using in your work by saying something along the lines of "here we assumed X is a reasonable cutoff to distinguish open and closed pangenomes". If you really want to cite something and outsource the burden of justifying your choice of a number, there is a paper here,

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08223-8

which is cited 0.3 with the following context at least once in the literature,

(...) We have categorized the pangenomes into two categories as per Hyun et al. (2022), closed pangenome (λ < 0.3) and intermediate open (λ > 0.3).

even though the cited paper does not necessarily include a clear justification for 0.3:

https://www.sciencedirect.com/science/article/pii/S0740002023001211

I trust you will do what you must with this information :)

My 2 cents.

Best wishes, Meren

meren avatar Jun 17 '25 16:06 meren

Thanks Meren! Yeah! This is really helpful!😃😃

Cheers, Ashwin

AshSudarshan avatar Jun 17 '25 17:06 AshSudarshan