cDNA_Cupcake
cDNA_Cupcake copied to clipboard
Some questions about call_variant in isoPhase
Hi Liz,
I'm using isoPhase to assign haplotypes to isoforms, but I have some questions in reading your source code. Here https://github.com/Magdoll/cDNA_Cupcake/blob/master/phasing/io/MPileUpVariantCaller.py#L135, you used Fisher-exact test to calculate pvalue, but after that why did you use pval *= self.number_of_tests to increase pvalue and then to test for significance?
In my case with smaller read coverage(such as 11 reads, 6 for reference, 5 for SNP), even if the pvalue in Fisher-exact test is significant (less than 0.05), it will also be filtered out because of the larger pvalue after multiplying by number_of_tests. It seems like isoPhase works better with large read coverage (such as 40) to avoid the case I met, but how can I handle the case with smaller read coverage? Thanks.
Best, Feng Xu
HI @CrazyHsu ,
The pval *= self.number_of_tests
is adjusting more multiple testing. It is not related to the read coverage but the number of Fisher exact tests that are being done.
-Liz
Hi @Magdoll ,
Yes, my case is there is 11 reads mapped with 7 alleles, and all reference alleles covered by 6 reads and alternative alleles by 5 reads.
In this case, the expression stats.fisher_exact([[count, r.clean_cov-count], [exp, r.clean_cov-exp]], alternative='greater')
for single allele will be stats.fisher_exact([[5, 11-5], [0.005*11, 11-0.005*11,]], alternative='greater')
, and the pvalue is 0.0227 which seems like significant (<0.05).
But after adjusting with multiple testing pval *= self.number_of_tests
, the pvalue will be 0.1589 (0.0227*7) which is not significant, and then the case will be filtered out.
When it comes to the situation with 41 reads mapped with 21 relate to reference allele and 20 relate to the alternative allele, the pvalue of Fisher exact test will be 5.7329e-08, and after multiplying with number_of_tests, the pvalue will also less than 0.05 which means its significance. So, how can i handle this situation? Thanks.
Feng Xu
Hi @CrazyHsu ,
That is a very good point. Mechanistically (while not talking about whether this is statistically a good idea), you can run run_phaser.py
with a different p-val cutoff using the parameters --pval_cutoff
.
The other is I can add a param to turn off multiple testing and have users user it at their own risk.
Do you have a pref? -Liz
Hi @Magdoll ,
I'm not very good at statistics, so I don't quite understand the statistical implications of adding multiple tests. As for me, I prefer the latter one without the multiple tests. But I don't know if it is still statistically significant.
Best Feng Xu
@CrazyHsu let me think about this and get back to you. I may implement other ways to corect multiple testing that is not as strict as the Bonferroni
@Magdoll Thank you for your prompt reply and look forward to your new ideas.
Best, Feng Xu
Hi @CrazyHsu Please try out the experimental version of IsoPhase that allows for the use of Benjamini-Hochberg procedure instead of Bonferroni correction that should increase the sensitivity of identifying SNPs.
To do this you will need to switch to a diff branch of Cupcake
git clone https://github.com/Magdoll/cDNA_Cupcake.git
git checkout magphase
python setup.py build
python setup.py install
Now there's a new parameter
$ run_phaser.py --help
usage: run_phaser.py [-h] -o OUTPUT_PREFIX --strand {+,-} [--partial_ok]
[-p PVAL_CUTOFF] [--bhFDR BHFDR] [-n PLOIDY]
fastx_filename sam_filename mpileup_filename read_stat
mapping_filename
positional arguments:
fastx_filename Input FLNC fasta or fastq
sam_filename
mpileup_filename
read_stat
mapping_filename
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
--strand {+,-}
--partial_ok
-p PVAL_CUTOFF, --pval_cutoff PVAL_CUTOFF
--bhFDR BHFDR FDR to be used for the Benjamini–Hochberg correction.
Default: None (not used).
-n PLOIDY, --ploidy PLOIDY
I recommend running run_phaser.py
with --bhFDR 0.01
as a first try, which is setting it so a 1% FDR. If you want to increase sensitivity (but likely to get more false calls), you can increase that number, to say, 0.02, 0.03, 0.05, etc.
Please let me know how this goes. -Liz
--Liz
Hi @Magdoll , Thanks for your new version, I will try it out and give you feedback.
Hi @Magdoll ,
I have tried the new version, and I found some issues.
As shown in the IGV as below, you can clearly see at least 8 bases showing a allelic imbalance. Here is the ccs.mpileup file I used.
But after trying the new version, when bhFDR was set to 0.01, it turned out no SNP_FOUND; when bhFDR was set to 0.05, only 6 SNPs found, all these results aren't consistent with the observation.
Then I tried to figure out the reason causing the difference, I found there were many sequencing errors called by positions_to_call (it can also be clearly observed in the picture above) which enlarged self.number_of_tests value. And accordingly, the value of bh_val is scaled down, leading to many allelic SNPs be filtered out. It seems like sequencing error can cause a false negative.
In my next attempt, I made some modifications to MPileUpVariantCaller.py as follows: line 166
odds, pval = stats.fisher_exact([[count, r.clean_cov-count], [exp, r.clean_cov-exp]], alternative='greater')
if pval <= vc.pval_cutoff: # With this filtration, the sequencing errors position will not be stored in pval_dict.
pval_dict[(pos, base)] = BHtuple(pval=pval, record=r)
bh_val = ((rank0+1)/len(keys_pos_base)) * vc.bhFDR # Only significant positions will be used to adjust bh_val
With the modifications above, I got all the SNPs observed in the IGV with the bh_fdr was set to 0.01.
I don't know if the modifications are statistically reasonable, how do you think of it? Looking forward to your opinion.
Best, Feng Xu
Hi @CrazyHsu , I like this change a lot! I'm testing it out now and it should be in the next version of Cupcake in a few days! -Liz
Thanks for your adoption, looking for your next version!
Feng Xu