varianttools SKAT errors from the new version

In November 2018 we I ran some test everything was ok (python2.7). However, when trying to run the tests these days (python3.5) with the same commands we have run into some problems.

Variant tools presented in mdabioinfo/varianttools image is based on python3.5.

Running SKAT association in this way:

vtools associate skat_videre_rare_ExACRare astma_onset -m "SKAT --name skat quantitative" 
  --group_by refGene.name2 -v2 --to_db skat -j8 --forc > skat_results_astma_onset.txt

leads to

WARNING: Field sample_size_skat has all missing values
WARNING: No valid integer values has been found for field sample_size_skat
WARNING: Field Q_stats_skat has all missing values
WARNING: No valid float values has been found for field Q_stats_skat
WARNING: Field pvalue_skat has all missing values
WARNING: No valid float values has been found for field pvalue_skat
INFO: Association tests on 632 groups have completed. 632 failed.
INFO: Using annotation DB skat as skat in project asthmaburden.
INFO: Annotation database used to record results of association tests. Created on Wed, 20 Mar 2019 11:24:24
INFO: 632 out of 25360 refGene.refGene.name2 are annotated through annotation database skat

Mar 20 '19 14:03 BoPeng

@gaow @jma7 This basically means all tests have failed. Have you seen this before? How can the user know why the tests failed?

Mar 20 '19 14:03 BoPeng

Other association test commands are good? Let me look into this (using some tests from my own sources) to see if I can reproduce.

Mar 22 '19 16:03 gaow

Example command from vt_exmoneAssociation seems to be ok. Let me try to ask for more details.

Mar 22 '19 16:03 BoPeng

@BoPeng I can reproduce it with other non-SKAT test. Let me wrap up my result and share with you -- persumebly will make testing and fixing it easier with a small yet not trivial dataset.

Mar 22 '19 16:03 gaow

So all tests now fail?

Mar 22 '19 16:03 BoPeng

Before i wrap up with the test to share with you, here is a preview on the log file:

2019-03-22 11:40:43,192: DEBUG: An ERROR has occurred in process 0 while processing 'ZNF571': group ``/chr19/ZNF571`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr19/ZNF571`` does not have a child named ``rownames``
2019-03-22 11:40:43,194: DEBUG: An ERROR has occurred in process 0 while processing 'ZNF616': group ``/chr19/ZNF616`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr19/ZNF616`` does not have a child named ``rownames``
2019-03-22 11:40:43,196: DEBUG: An ERROR has occurred in process 0 while processing 'ZNF642': group ``/chr1/ZNF642`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr1/ZNF642`` does not have a child named ``rownames``
2019-03-22 11:40:43,198: DEBUG: An ERROR has occurred in process 0 while processing 'ZPBP2': group ``/chr17/ZPBP2`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr17/ZPBP2`` does not have a child named ``rownames``
2019-03-22 11:40:43,199: DEBUG: An ERROR has occurred in process 0 while processing 'ZSWIM1': group ``/chr20/ZSWIM1`` does not have a child named ``rownames``
tables.exceptions.NoSuchNodeError: group ``/chr20/ZSWIM1`` does not have a child named ``rownames``

So, HDF5 issues ...

Mar 22 '19 16:03 gaow

@jma7 I thought that all association tests have passed, right?

Mar 22 '19 16:03 BoPeng

@BoPeng here is my test:

https://github.com/gaow/ismb-2018/blob/dev/VAT-ISMB-2018.ipynb

the notebook is in dev branch. All data required are in the repository.

There are some external tools involved to complete the notebook. In the notebook I used export PATH=. I now uploaded these tools to here for you to completely reproduce my results.

You can find the failed association tests in the notebook. I have not completed that notebook due to some bugs in current version. I will use another ticket to explain in detail.

You might recall I have being relying on this notebook to test out the software; but I never had luck with the version 3.0 series.

Mar 22 '19 17:03 gaow

Hi, Gao, sorry for the delay, I have been working on other things, just got some time to switch back. I have run through your notebook. For Ln[33], I don't know why the number you have are all doubled, since the number on my local are all correct. It seems the main problem is Ln[58]-Ln[64], I got the same error message on Ln[58] at the first try, then I added the --force to the command line, then this command and the rest of commands went through, so there are some issues there which I will check.

Mar 22 '19 19:03 jma7

@jma7 thanks for looking into this -- very helpful! Not sure what you mean by Cell 33, the vtools phenotype output command in this notebook? The results there looks correct to me; not sure what is the issue that you are referring to? In any case yes I agree the association test failure is more fundamental, and it might be related to how the temp file works?

Mar 22 '19 20:03 gaow

Yes, the tmp*multi_gene.h5 files are related to the specific groups(genes) you'd like to do association with, somehow like the temp DB used for SQLite version( if I remember correct). Without the --force it will use the temp files generated before which is not the same group of genes. It has to be regenerated if there is another group of genes you'd like to check, that is why the --force will temporarily fix the problem. For cell 33,

sample_name	CEU_totalGD10	CEU_numGD10	YRI_totalGD10	YRI_numGD10
NA06984	5548	1698	NA	NA
NA06985	3888	1140	NA	NA
NA06986	6772	2058	NA	NA

But on https://gaow.github.io/ismb-2018/

sample_name	CEU_totalGD10	CEU_numGD10	YRI_totalGD10	YRI_numGD10
NA06984	2774	849	NA	NA
NA06985	1944	570	NA	NA
NA06986	3386	1029	NA	NA

I was mainly use the latter for comparison with my local, which one is correct?

Mar 22 '19 21:03 jma7

@jma7 thanks for the clarification on the temp DB. Glad we found the culprit.

For Cell 33 -- the halved result was done using vtools version 2.7. So I'd take a bold guess that version 3.0 is wrong and is off by a factor of 2? Possibly we do not have a unit test to catch this quantitatively? But good eyes of yours to catch this problem! No I do not know the "truth". The only tool I used to look at this data is vtools (version 2.7 and 3.0). Maybe it is worth looking into?

Mar 22 '19 22:03 gaow

I tried the SKAT using the vt_ExomeAssociation dataset in the documents, the error message I see with -v2 is

DEBUG: Association test skat failed while processing 'RIF1': is.atomic(x) is not TRUE

Testing for association:   0.1% [>                                                                                                                                                                                                                      ] 3/3 1.5/s in 00:28:34DEBUG: Association test skat failed while processing 'NDUFA10': is.atomic(x) is not TRUE

DEBUG: In 'ELOVL1', 2 out of 4 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: In 'PRKAG3', 4 out of 11 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: Association test skat failed while processing 'PKLR': is.atomic(x) is not TRUE

DEBUG: Association test skat failed while processing 'CYBRD1': is.atomic(x) is not TRUE

Testing for association:   0.2% [>                                                                                                                                                                                                                      ] 6/6 1.9/s in 00:21:46DEBUG: In 'VAMP3', 1 out of 2 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: Association test skat failed while processing 'NECAP2': is.atomic(x) is not TRUE

Notice the is.atomic(x) is not TRUE message, I am not sure where this coming from.

Other Single gene, collapsing, aggregation and weighted tests work fine. However running variable thresholds test shows the below error:

DEBUG: In 'RGS13', 1 out of 4 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: In 'CLSPN', 7 out of 14 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: Association test VariableThresholdsBt failed while processing 'SPRED2': Operator VariablePermutator raises an exception (Error in iteration 1 inverting Hessian matrix: matrix is singular)
DEBUG: In 'C2orf48', 1 out of 2 loci will be removed due to having no minor allele or having more than 100.0% missing genotypes
DEBUG: Association test VariableThresholdsBt failed while processing 'C2orf48': Operator VariablePermutator raises an exception (Error in iteration 1 inverting Hessian matrix: matrix is singular)

All the association tests using the same genotype info from *multi_genes.h5 file, so it seems to me the problem is not in the temp h5 file.

Mar 26 '19 22:03 jma7

These are different from the reported problem, right? It is normal for some genes to fail but not normal for all of them to fail.

Mar 26 '19 22:03 BoPeng

These is the same error as you saw, but different from what Gao saw.

Mar 26 '19 22:03 jma7

@jma7 I think the error @BoPeng initially reported with SKAT and I saw (not using SKAT) are the same (starting from the WARNING message) but are different from what you reported here for SKAT. The message:

is.atomic(x) is not TRUE message

is from R. Atomic data are numbers, vectors and matrices in R. Not sure if it is some type conversion issues for this one, but with HDF5 we can certainly improve the R interface to load directly from the HDF5 file now. We can worry about it later, though, after the other problem (that @BoPeng initially reported) is resolved -- I suggest at this point lets try get non-R-based method work.

Mar 26 '19 22:03 gaow

At least we can check the difference between vtools 2 and 3.

Mar 26 '19 23:03 BoPeng

So the problem is when converting python object to R object. Since the type of nan is float in python, the genotypes is saved as numpy.float64 in hdf5 file for now. When genotype was sent for calculation, the numpy.float64 is converted to list, then is.atomic(list()) evaluated as false. It might be too much work to change the format of hdf5 file. It is more practical to convert numpy.float64 to integer before sending genotypes to R for association calculation. I think this is the cause of the error observed by Bo and me. It seems to me Gao's error is caused by something wrong in the tmp*_multi_genes.h5 files, which generated before the association calculation. Could you let me know how to reproduce the error?

Mar 27 '19 05:03 jma7

@jma7 Thanks for tracking down the problem! However unfortunately we cannot use integer type for genotype matrices because there might be imputed genotypes. So np.float16 is possibly the best data type to use. I cannot recall how we currently handle imputed genotypes (as a field in genotype database?) but I'm just pointing out that temporary genotype is best np.float16 (as suggested in #99 which also saves disk space).

Could you let me know how to reproduce the error?

See #97 first item in "Failures" -- is that what you are asking?

Mar 27 '19 12:03 gaow

varianttools varianttools copied to clipboard

SKAT errors from the new version

varianttools
varianttools copied to clipboard