grenedalf Comparison npstat vs Grenedalf: strong correlations but systematic biases

Having previously used npstat for calculating theta pi, theta Watterson, and Tajima’s D, but being very interested in Grenedalf’s greater flexibility, I conducted some empirical comparative tests between the two methods.

I performed these tests on 8 populations from 4 insect species (2 populations per species), with genome sizes ranging from 250Mb to 2.5Gb. These populations consist of pooled samples of 40 to 50 individuals, with an average sequencing depth of approximately 100X. My tests involved window-based analyses using windows of 100,000 bases.

The general trends I observe are as follows: • A very high correlation between the two methods for all three statistics (generally r > 0.95), except for one of the four species (details below). • A systematic bias: npstat tends to yield higher values for theta pi, whereas Grenedalf tends to yield higher values for theta Watterson; no clear bias is observed for Tajima’s D. This trend is consistent across the 8 populations from the 4 species.

Among the four species tested, I observed an exception with the species having the largest genome (2.5Gb, with 10 chromosomes). In this case, the correlations for theta pi and theta Watterson are much lower, between 0.6 and 0.7. However, the correlations for Tajima’s D remain very high (>0.97). Notably, for this species only, I also tested larger window sizes (500,000 and 1,000,000 bases), but no improvement was observed (the figures were very similar, and there was even a slight decrease in correlations).

Have you ever observed this kind of trend? Do you have any idea what might be causing the systematic biases in theta estimates? Overall, I am unsure which tool yields the most reliable results, and I would be very interested in any feedback or comparative evaluations.

Thanks in advance to the community!

Apr 29 '25 07:04 elombaert

Hi Eric,

thank you for raising these questions and for reporting your experience with the tools on your data! Those are good points worth investigating (for your sake, and for the whole community), and I'd be happy to assist as far as I can.

Have you ever observed this kind of trend?

No - outside of the comparisons performed for the grenedalf publication, I have not used npstat at all. I find its usage rather cumbersome (only working on single populations and single chromosomes), and it's code is a bit messy, so I saw no easy way to fix this.

Do you have any idea what might be causing the systematic biases in theta estimates?

I assume that you have made sure to set all parameters and filters to the same values? The tools have different defaults for settings such as min cov and frequencies, so those at least need to be provided for both to make sure they are working with the same filters.

Also, what other differences are there between your datasets, other than genome size? Is all the data of roughly the same read depth and pool sizes? Did they use the same sequencing technology (which might cause differences due to sequencing errors)?

I've also just had another look at their paper, and could not find any glaring differences that would explain your observations. Most likely, the differences are due to subtle aspects of the code, such as window normalization, order of filters, or potentially slight differences in the actual pop gen equations.

Overall, I am unsure which tool yields the most reliable results, and I would be very interested in any feedback or comparative evaluations.

Your empirical tests give a first hint. To get to the bottom of this though, I think it might be necessary to look at synthetic small test cases instead. For instance, when we were investigating the differences to PoPoolation, we looked at pre-defined inputs of exact data for which we could compute the ground truth by hand (or by a small function), for all permutations of a range of pool sizes and read depths. You can find these tests here: https://github.com/lczech/grenedalf/tree/master/test

These tests simply use the pool size and read depth, and pretend that those are exact, so that we can compare the output of the tools to the ground truth without any obscuring factors such as errors, uneven coverage, or missing data. My suggestion is to adapt this to also run npstat. The main test script execute_tests.py includes tests for PoPoolation, and can be expanded to do the same for npstat. The script create_files.py produces mpileups that can probably directly be used for that.

Unfortunately, I'm a bit short on time at the moment due to grant writing deadlines, but might get back to this later this summer. But if you want, you can start experimenting with this - happy to help with any questions or issues, and we can keep this thread open here for that.

Lastly, I have long meant to implement the remaining estimators that npstat offers (Fu and Li’s D and F, Fay and Wu’s H, McDonald-Kreitman and HKA tests), but will unfortunately not find the time in the foreseeable future. But I hope to get around to this eventually!

Hope that helps, and curious to hear more Lucas

Apr 29 '25 20:04 lczech

Hi Lucas,

Thanks for your response.

Yes, of course, I did my best to obtain comparable analyses between the two tools by carefully selecting the various parameters and filters. Following some doubts about the similarity of certain filters, I carried out a few checks, which had negligible impacts on the results.

Regarding other differences/similarities between the datasets:

Sequencing depths are fairly similar (~100X). Only one of the four species has a lower depth (~80X), but it’s not the one with the largest genome.
The sequencing technologies are the same, except for one species, which happens to be one of those showing a good correlation.
Genome quality varies, ranging from highly fragmented (~400 scaffolds for one species) to chromosome-level assemblies. The species with the largest genome has a well-assembled genome at the chromosome level.
The number of individuals is 50 (100 haplotypes) for all populations, except for those of the species with the largest genome, where pools consist of 40 individuals (80 haplotypes). This is the only clear difference for that population (which happens to yield the lowest correlations).

Thanks for the suggestion of a complementary analysis. I’ll try to find some time to explore that avenue and will definitely keep you updated.

Best,

Eric

Apr 30 '25 05:04 elombaert

Hi Eric,

thanks for the details! Nothing there that jumps out to me as a potential cause for the discrepancies you are observing. So I guess this needs a bit of a deeper dive.

Let me know if I can help there - and looking forward to hearing about what you find!

Cheers and so long Lucas

May 08 '25 11:05 lczech