Different Results Using EPACTS version 3.4
Dear All,
I am just curious what is the difference between EPACTS version 3.4 and 3.2.6 or 3.3.2, As I was unable to find a proper documentation on versions. I compared results of all these versions I gave the same input files called the same method the result from 3.2.6 and 3.3.2 were exactly the same (and makes more sense according to my data) whereas the result from 3.4 was totally different and doesn't make sense according to my data as well. so I am just curious what is the difference between these versions?
ps. the command I used for all these versions is as follows.
epacts-single --vcf filtered.vcf.gz --ped phenotype.ped --pheno PHENO --test b.glrt --anno --out Single_SNP_Analysis_COmplete --run 2 | tee log.txt
I look forward to hearing from you.
Kind Regards, Haider
The only difference that is known to affect results is the handling of multiallelic sites. Can you elaborate on doesn't make sense according to my data?
thanks for your quick response. So by doesn't make sense I meant in my vcf file that I provided I had like 20k variants (not multiallelic: have already filtered them) but after the analysis I only got like 1040 variants. where as in version 3.2.6 and 3.3.2 I got same number of variants.
Secondly the allele frequencies in cases and controls and overall allele frequencies were also wrong where as 3.2.6 and 3.3.2 gave had the right frequencies I confirmed it through manual calculation as well.
so that was a bit strange for me I thought that maybe 3.4 uses a different build but still not sure what went wrong.
Ok. I'll test that exact command and get back to you. It might take a couple days to get to this though.
ok that sounds perfect thanks again. looking forward to hearing from you.
have a nice weekend :)
Can you pull the latest from the develop branch and give it a go? You will need to reinstall the savvy dependency as well after pulling the latest (cget remove statgen/savvy; cget install -f requirements.txt).
There were two issues that I fixed. One was the handling of PL and GL data, which caused incorrect data. The other issue was the handling of variants that do not have PL data. I suspect this is why the number of variants differ in your analyses.
Hey Good Morning,
Thank you so much for the quick look up and sorry for my late response I was travelling. ok I will let our cluster team know about the changes. I am just curious if these changes will fix the incorrect frequencies as well?
Thanks again.
Regards, Haider
@jonathonl
Thanks again for your quick response. I just tried the analysis again but unfortunately the results were same.
Just to confirm things if we are using the right version as I am using this tool on our server our team couldnt find a git tag named 3.4 in the repo. please see below the response from the team.
I installed git version 3.3.0 and the latest commit from the development branch of the git repo in a container; I think that the ladder is a 3.4.0. Unfortunately, there is no git tag named 3.4.0 in the repo, so I can only refer to a specific git commit ID in order to specify the version. Our version 3.4.0 is commit ID 98d7ff7c0f0831d521324f16671d8fa989800f63 ( https://github.com/statgen/EPACTS/commit/98d7ff7c0f0831d521324f16671d8fa989800f63 )
if we are doing something wrong can you please direct us in the right direction?
I look forward to hearing from you.
Kind Regards, Haider
That is the correct commit id. A tarball can be found at https://github.com/statgen/EPACTS/archive/98d7ff7c0f0831d521324f16671d8fa989800f63.tar.gz. Are you saying that you are getting the same exact results when running b.glrt with this commit as you do when running with v3.4-rc? That would be very unlikely. I would double check that you are using the correct docker image. Maybe run docker images to see the details of your local images and then run docker ps -a to confirm that the container you ran matches the correct image ID.
Also, am I correct in assuming you have PL data in your VCF file?
@jonathonl
Thanks again for your response. I just double checked the versions and re-ran the analysis again but unfortunately got the same results. for your reference and better understanding I am attaching the log files of the run through different versions using same command and the recipe by which difference versions of epacts were installed on the container. if you want I can also share the output files from different versions.
log_3.4.0.txt log_3.3.0.txt log_3.2.6.txt log_3.4rc.txt
recipe_3.3.0.txt recipe-v3.2.6.txt recipe_3.4rc.txt recipe_3.4.0.txt
lastly, yes I have PL data as well in my vcf file. I have attached a chunk of my vcf file as well for your reference please find attached.
I look forward to hearing from you.
Kind Regards, Haider Filtered.txt
I did some tests with the Filtered.txt VCF file you supplied. One problem is that this file has carriage returns, which breaks the parsing on the master branch version (v3.30). This may have just been an artifact of subsetting, and the CRs may not exist in your original VCF. Another issue is that this VCF does contain multiallelic variants. In the master branch the extra alleles are ignored. In the develop branch, the file parsing stops for that chunk. Apparently it doesn't propagate the error up the call stack resulting in truncated regions. This is why the number of variants doesn't match.
Can you run the following command on your end? The input files are in the tarball below. The expected results are also in the tarball.
epacts-single --vcf smhaider-nocr-noma.vcf.gz --ped smhaider_dummy_pheno.ped --pheno DISEASE --test b.glrt --anno --out test-out --run 2 --chr 1
oh yes I am so sorry I forgot that I filtered for multi-alleles after this vcf. I mixed it with another vcf file my bad. and I will test it and get back to you as soon as possible currently our cluster is down for maintenance as soon as it is up again will test it and get back to you.
thanks again for all your assistance and help I really appreciate that.
Kind Regards, Haider
@jonathonl
the link you provided is broken can you please share it again?
https://github.com/statgen/EPACTS/archive/98d7ff7c0f0831d521324f16671d8fa989800f63
unable to access this link.
Kind Regards, Haider
Just needs a tar.gz at the end https://github.com/statgen/EPACTS/archive/98d7ff7c0f0831d521324f16671d8fa989800f63.tar.gz
thanks again will get back to you soon :)
@jonathonl
I hope you are fine. so I have just tested with the material you provided and the results are matching with your output. so does that mean I need to remove multialleles from my vcf and I will be good to go??
secondly I tested your files with 2 versions 3.3.0 and 3.4.0 I got same results like your with 3.4.0 but different results from 3.3.0. please see both results in the attachment. just wondering why they are different?
secondly I tested my real vcf with version 3.3.2 and 3.2.6 the results from both versions are matching and seems to be correct as well. so Just need your suggestions should I change my vcf and remove the multialleles or should I just keep using 3.3.2. the commit id used for 3.3.2 was "a5209db1b3929c4dd2f15f27ea085edf3a634ee7"
Lastly I am mainly interested in q.emmax testing for that I calculated the kinship matrix while running it through 3.2.6 and 3.3.2 I got some warnings during the run there wasn't any error but I am not sure if these warnings would have effected the results so I was just wondering if you can have a look at the run, I would really appreciate that. I have attached the log file as well.
Thank you again for all your help and assistance.
I look forward to hearing from you.
Kind Regards, Haider
The versions are admittedly a little wonky, and I will eventually clean that up. In the meantime, let me make some clarifications. According to recipe_3.3.0.txt, what you are calling v3.3.0 is the same as the git tag v1.0, which is the same as commit a5209db1b3929c4dd2f15f27ea085edf3a634ee7, which you also said was v3.3.2. So v3.3.0, v3.3.2, v1.0 and the current HEAD of master are all the same since they all represent the commit a5209db1b3929c4dd2f15f27ea085edf3a634ee7. For the sake of simplicity, I'm going to call this commit master.
When I run the test files in the tarball I provided above, I get the same exact results between what is on master (a5209db1b3929c4dd2f15f27ea085edf3a634ee7) and what is on develop (98d7ff7c0f0831d521324f16671d8fa989800f63). These results also match what you provided in test-out-v3.4.0.epacts.gz.
I would recommend running bcftools norm -m- <input.vcf.gz> to convert the multiallelics to biallelics. This will give you more complete results for all version.
I would stick with using the version that is on master (a5209db1b3929c4dd2f15f27ea085edf3a634ee7). There some issues still being worked out in the develop branch, and it is not entirely stable.
As for you kinship error log, I suspect that your dataset is sparse (chip data maybe) and some of the regions being processed do not include any variants above the minMAF. You can confirm this by looking at those regions with bcftools.
@jonathonl
thank you so much again. everything is clear and I really appreciate your help.
Kind Regards, Haider
@jonathonl
I hope you are doing well. Sorry for bothering you again. I have a quick question. I ran q.emmax test on my LOF variants using commit master. it seems to run fine so far without any error the makefile is updating time to time. but its been like 5 days since the analysis is running on the server so I was just wondering if it is normal or something's is fishy? on the documentation page it is stated that its a slow analysis but I am just curious if it is normal that it will take that much time.
ps. I am running the analysis with 50 cores and 200gb memeory.
I look forward to hearing from you.
Kind Regards, Haider
I would expect that to be normal if you have tens of thousands of samples. You can get an idea of progress by looking at the output directory. There will be temporary files for each region that has been run.
ok thank you again for your response. So just to confirm if I am understanding you correctly: I have 18125 samples in my vcf file. I made the kinship matrix out of it using your standard guidance. then I ran the q.emmax analysis 5 days ago. so in the output folder, it generated 4 files instantly when I ran the analysis. 1: annotated.ind. 2: annotated.cov 3: annotated.Makefile 4: annotated.phe but from last 5 days the total number files didnt change in the output directory. even the size. apart from the makefile. the size of Makefile did increase. but it didnt generate any other file since then.
Secondly I am constantly checking the log file of the run it didnt update since 5 days Kindly please have a look at the attachment.
Although the analysis is running (no error so far) and it is consuming memory and CPU's but I am bit worried if it is normal or there is something wrong.
I am sorry for bothering you again and again and I really appreciate your help.
Kind Regards, Haider
According to the log, it's stuck on matrix inversion, which I believe should be single threaded. What percentage of CPU and memory is the pEmmax process using?
@jonathonl
Thanks again for your response. ok I just checked the run a while ago and this process seems to be working:
/usr/local/bin/pEmmax reml --phenof /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.phe --covf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.cov --kinf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Kinship_Matrix/kinship.kinf --indf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.ind --out-eigf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.eigR --out-remlf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.reml
It constantly uses one core at 100% and 16.9GB of RAM
There are two other processes running:
make -f /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/Results/SVT.LOF.annotated.Makefile -j 4 /usr/bin/perl -w /usr/local/bin/epacts-single --vcf /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/LOF_annotated.vcf.gz --ped /data/icg_syed/MIP_CAD/Burden_Test_Analysis/LOF_SVT/phenotype.ped -min-maf
They ran for a few seconds, at most. They seem to have started the first, working process.
There is almost no communication with /data, so it's probably not writing anything to my output folder.
I dont really know whether it is normal or not. So I am looking forward to your advise. As its already been 6 days and I put the script for 10 days only so I am a bit worried as the server will cancel my job in 4 days.
Looking forward to hearing from you.
Kind Regards, Haider
6 days does seem too long for the reml step with 18,000 samples. I would still let the job run to see if it finishes before the time limit. In the meantime, I'm curious whether modifying the SVD problem slightly would have an effect. Maybe try adding a few more covariates. To be honest, I'm kind of grasping at straws here.
@jonathonl
Thanks again. So the script just stopped but somehow it didnt calculate anything. there are all NA's in the pvalues, beta, sbeta, columns etc.
I have also attached the log file and the traceback file for your reference. Can you kindly please have a look at it again and point out the problem?
and below you can also find few lines of the generated output.
Looking forward to hearing from you.
Kind Regards, Haider
All of the variants in the results file have a MAC below the threshold. I suspect you will have valid p-values for variants with MAC > 3 and MAF > 0.001. Have you looked at the manhattan plot?
Hi @jonathonl ,
Thanks for your help so far...however I still do not understand your answer: Firstly, I only set a MAF threshold of minimum MAF > 0.001 but no MAC as the tools manual just requires a minMAF cutoff; secondly, I have 74 variants with Allele Count >3 however there are no pvalues for these 74 as well (all pvalues are shown as NA) and due to this, the Manhattan Plot is empty. Kindly Please have a look at the complete result of epacts and the Manhattan plot below.
There are default min MAC (3) and min MAF (0.001) thresholds for the epacts single command. You can see them being passed down to pEmmax in the logs. Is this the same set of variants you used for epacts make-kin? That subcommand has a default MAF threshold of 0.01. Common variants are necessary for generating a useful kinship matrix.
@jonathonl
First of all thank you so much for your response and guidance so far. I am sorry but I didn't understand why it is better to include common variants in the kinship matrix? I am only interested in rare variants. the vcf file that I am using is already filtered for rare variants. and I used the same file to generate the kinship matrix as well.
Secondly I wanted to explain one thing in detail might be I am doing things totally wrong.
The vcf file that I am using for SVT is filtered for rare variants and it is also annotated. one of my senior suggested me to extract only the LOF (frameshit, stop-gain, start-loss etc) variants out of it and run the Single variant analysis (q.emmax) on only the LOF variants as well as on the complete vcf file just to see if there is a difference. So I did the same. I extracted only the LOF variants from my annotated file zipped it according to the instructions, generated the kinship matrix using LOF.annotated.vcf.gz file and ran the single variant analysis. The files that I have shared with you so far they are the results of single variant test on only the LOF variants. this thing ran for more then 7 days.
After that just for test I ran same test on the complete vcf file containing all the variants. but to my surprise the analysis finished in just 5 hours. and it did generate some results.(although I am not sure if they are alright or not) BUT UNLIKE ONLY LOF variants VCF file, I have pvalues for allot of variants when I ran the analysis on my complete vcf file.
But, even for my complete vcf files output there are many variants with MAC greater then 3 still there are no pvalues. and a strange thing is the there are many variants who have same MACs same MAF and same CALLRATE still some has pvalues and some only have NA. so I am not sure if the results are reliable or not. I have also attached the output file for the complete VCF file for your reference.
LASTLY, just to mention I have used the same PED file for both LOF and Complete vcfs.
I am sorry for bothering you again and again but I really need your assistance as I am pretty confused with my results.
I hope I have managed to explain the scenario if you need more information please let me.
I look forward to hearing form you.
Kind Regards, Haider Single_Variant_Complete_VCF.xlsx
You should use the kinship matrix generated with the full VCF in both the full and LOF association tests. The kinship matrix is an input to the mixed-model for each variant and is independent of the set of variants being analyzed. With small datasets, one could generate a kinship matrix from a pedigree without even looking at the individuals' genotypes. But for large datasets, we need to infer relatedness, and including common variants allows those inferences to be more accurate. Imagine trying to infer relatedness from a dataset of cousins using only singletons. The result would be a matrix of zeros even though these individuals are highly related.
5 hours is more inline with the expected time for analyzing ~20,000 samples.
The MAF column is rounded. For example, the actual MAF for the marker 10:82182287_TC/CC,GC,AC,T is 36 / (2 * 17954), which is greater than 0.001. The MAF for 2:216285445_G/A is 36 / (2 * 18002), which is less than 0.001.
@jonathonl
ok makes sense. thank you so much for explaining it nicely. I have just ran the single variant test again on my raw vcf file. will test the LOF and filtered vcf as well with the same kinhsip matrix used for raw vcf and will get back to you again.
Thanks again for all your help I really appreciate that.
Kind Regards, Haider