varianttools
varianttools copied to clipboard
SKAT errors from the new version
In November 2018 we I ran some test everything was ok (python2.7). However, when trying to run the tests these days (python3.5) with the same commands we have run into some problems.
Variant tools presented in mdabioinfo/varianttools image is based on python3.5.
Running SKAT association in this way:
vtools associate skat_videre_rare_ExACRare astma_onset -m "SKAT --name skat quantitative"
--group_by refGene.name2 -v2 --to_db skat -j8 --forc > skat_results_astma_onset.txt
leads to
WARNING: Field sample_size_skat has all missing values
WARNING: No valid integer values has been found for field sample_size_skat
WARNING: Field Q_stats_skat has all missing values
WARNING: No valid float values has been found for field Q_stats_skat
WARNING: Field pvalue_skat has all missing values
WARNING: No valid float values has been found for field pvalue_skat
INFO: Association tests on 632 groups have completed. 632 failed.
INFO: Using annotation DB skat as skat in project asthmaburden.
INFO: Annotation database used to record results of association tests. Created on Wed, 20 Mar 2019 11:24:24
INFO: 632 out of 25360 refGene.refGene.name2 are annotated through annotation database skat
@gaow @jma7 This basically means all tests have failed. Have you seen this before? How can the user know why the tests failed?
Other association test commands are good? Let me look into this (using some tests from my own sources) to see if I can reproduce.
Example command from vt_exmoneAssociation seems to be ok. Let me try to ask for more details.
@BoPeng I can reproduce it with other non-SKAT test. Let me wrap up my result and share with you -- persumebly will make testing and fixing it easier with a small yet not trivial dataset.
So all tests now fail?
Before i wrap up with the test to share with you, here is a preview on the log file:
2019-03-22 11:40:43,192: DEBUG: An ERROR has occurred in process 0 while processing 'ZNF571': group ``/chr19/ZNF571`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr19/ZNF571`` does not have a child named ``rownames``
2019-03-22 11:40:43,194: DEBUG: An ERROR has occurred in process 0 while processing 'ZNF616': group ``/chr19/ZNF616`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr19/ZNF616`` does not have a child named ``rownames``
2019-03-22 11:40:43,196: DEBUG: An ERROR has occurred in process 0 while processing 'ZNF642': group ``/chr1/ZNF642`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr1/ZNF642`` does not have a child named ``rownames``
2019-03-22 11:40:43,198: DEBUG: An ERROR has occurred in process 0 while processing 'ZPBP2': group ``/chr17/ZPBP2`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr17/ZPBP2`` does not have a child named ``rownames``
2019-03-22 11:40:43,199: DEBUG: An ERROR has occurred in process 0 while processing 'ZSWIM1': group ``/chr20/ZSWIM1`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr20/ZSWIM1`` does not have a child named ``rownames``
So, HDF5 issues ...
@jma7 I thought that all association tests have passed, right?
@BoPeng here is my test:
https://github.com/gaow/ismb-2018/blob/dev/VAT-ISMB-2018.ipynb
the notebook is in dev branch. All data required are in the repository.
There are some external tools involved to complete the notebook. In the notebook I used export PATH=. I now uploaded these tools to here for you to completely reproduce my results.
You can find the failed association tests in the notebook. I have not completed that notebook due to some bugs in current version. I will use another ticket to explain in detail.
You might recall I have being relying on this notebook to test out the software; but I never had luck with the version 3.0 series.
Hi, Gao, sorry for the delay, I have been working on other things, just got some time to switch back.
I have run through your notebook. For Ln[33], I don't know why the number you have are all doubled, since the number on my local are all correct.
It seems the main problem is Ln[58]-Ln[64], I got the same error message on Ln[58] at the first try, then I added the --force to the command line, then this command and the rest of commands went through, so there are some issues there which I will check.
@jma7 thanks for looking into this -- very helpful! Not sure what you mean by Cell 33, the vtools phenotype output command in this notebook? The results there looks correct to me; not sure what is the issue that you are referring to? In any case yes I agree the association test failure is more fundamental, and it might be related to how the temp file works?
Yes, the tmp*multi_gene.h5 files are related to the specific groups(genes) you'd like to do association with, somehow like the temp DB used for SQLite version( if I remember correct). Without the --force it will use the temp files generated before which is not the same group of genes. It has to be regenerated if there is another group of genes you'd like to check, that is why the --force will temporarily fix the problem.
For cell 33,
sample_name CEU_totalGD10 CEU_numGD10 YRI_totalGD10 YRI_numGD10
NA06984 5548 1698 NA NA
NA06985 3888 1140 NA NA
NA06986 6772 2058 NA NA
But on https://gaow.github.io/ismb-2018/
sample_name CEU_totalGD10 CEU_numGD10 YRI_totalGD10 YRI_numGD10
NA06984 2774 849 NA NA
NA06985 1944 570 NA NA
NA06986 3386 1029 NA NA
I was mainly use the latter for comparison with my local, which one is correct?
@jma7 thanks for the clarification on the temp DB. Glad we found the culprit.
For Cell 33 -- the halved result was done using vtools version 2.7. So I'd take a bold guess that version 3.0 is wrong and is off by a factor of 2? Possibly we do not have a unit test to catch this quantitatively? But good eyes of yours to catch this problem! No I do not know the "truth". The only tool I used to look at this data is vtools (version 2.7 and 3.0). Maybe it is worth looking into?
I tried the SKAT using the vt_ExomeAssociation dataset in the documents, the error message I see with -v2 is
DEBUG: Association test skat failed while processing 'RIF1': is.atomic(x) is not TRUE
Testing for association: 0.1% [> ] 3/3 1.5/s in 00:28:34DEBUG: Association test skat failed while processing 'NDUFA10': is.atomic(x) is not TRUE
DEBUG: In 'ELOVL1', 2 out of 4 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: In 'PRKAG3', 4 out of 11 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: Association test skat failed while processing 'PKLR': is.atomic(x) is not TRUE
DEBUG: Association test skat failed while processing 'CYBRD1': is.atomic(x) is not TRUE
Testing for association: 0.2% [> ] 6/6 1.9/s in 00:21:46DEBUG: In 'VAMP3', 1 out of 2 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: Association test skat failed while processing 'NECAP2': is.atomic(x) is not TRUE
Notice the is.atomic(x) is not TRUE message, I am not sure where this coming from.
Other Single gene, collapsing, aggregation and weighted tests work fine. However running variable thresholds test shows the below error:
DEBUG: In 'RGS13', 1 out of 4 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: In 'CLSPN', 7 out of 14 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: Association test VariableThresholdsBt failed while processing 'SPRED2': Operator VariablePermutator raises an exception (Error in iteration 1 inverting Hessian matrix: matrix is singular)
DEBUG: In 'C2orf48', 1 out of 2 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: Association test VariableThresholdsBt failed while processing 'C2orf48': Operator VariablePermutator raises an exception (Error in iteration 1 inverting Hessian matrix: matrix is singular)
All the association tests using the same genotype info from *multi_genes.h5 file, so it seems to me the problem is not in the temp h5 file.
These are different from the reported problem, right? It is normal for some genes to fail but not normal for all of them to fail.
These is the same error as you saw, but different from what Gao saw.
@jma7 I think the error @BoPeng initially reported with SKAT and I saw (not using SKAT) are the same (starting from the WARNING message) but are different from what you reported here for SKAT. The message:
is.atomic(x) is not TRUE message
is from R. Atomic data are numbers, vectors and matrices in R. Not sure if it is some type conversion issues for this one, but with HDF5 we can certainly improve the R interface to load directly from the HDF5 file now. We can worry about it later, though, after the other problem (that @BoPeng initially reported) is resolved -- I suggest at this point lets try get non-R-based method work.
At least we can check the difference between vtools 2 and 3.
So the problem is when converting python object to R object. Since the type of nan is float in python, the genotypes is saved as numpy.float64 in hdf5 file for now. When genotype was sent for calculation, the numpy.float64 is converted to list, then is.atomic(list()) evaluated as false. It might be too much work to change the format of hdf5 file. It is more practical to convert numpy.float64 to integer before sending genotypes to R for association calculation. I think this is the cause of the error observed by Bo and me.
It seems to me Gao's error is caused by something wrong in the tmp*_multi_genes.h5 files, which generated before the association calculation. Could you let me know how to reproduce the error?
@jma7 Thanks for tracking down the problem! However unfortunately we cannot use integer type for genotype matrices because there might be imputed genotypes. So np.float16 is possibly the best data type to use. I cannot recall how we currently handle imputed genotypes (as a field in genotype database?) but I'm just pointing out that temporary genotype is best np.float16 (as suggested in #99 which also saves disk space).
Could you let me know how to reproduce the error?
See #97 first item in "Failures" -- is that what you are asking?