rvtests icon indicating copy to clipboard operation
rvtests copied to clipboard

WARN: Failed to load spectral decomposition results of the kinship matrix (.xHemi.eigen file is incorrectly reused across independent runs of rvtests)

Open dpanyard opened this issue 4 years ago • 13 comments

Hi, I've encountered a warning and error in rvtests log when I try to conduct a GWAS. My code looks like this:

./software/rvtests/v2.1.0/executable/rvtest \
    --inVcf ./derived_data/genetic_data/30_chr12.imputed.poly.vcf.gz \
    --dosage DS \
    --kinship ./derived_data/genetic_data/32_kinship_matrix.kinship \
    --xHemiKinship ./derived_data/genetic_data/32_kinship_matrix.xHemi.kinship \
    --pheno ./derived_data/genetic_data/32_chrAll_phenotypes.txt \
    --pheno-name pheno48428 \
    --meta score \
    --out ./derived_data/x_temp_gwas_test_run

In my log, I see the following:

[INFO] Family-based model specified. Loading kinship file... [INFO] DONE: Loaded kinship file [ ./derived_data/genetic_data/32_kinship_matrix.kinship ] successfully in [ 0.5 ] seconds. [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.0 ] seconds. [INFO] Kinship eigen-decomposition file [ .xHemi.eigen ] detected and will be used for for X chromosome analysis [INFO] DONE: Loaded kinship file [ ./derived_data/genetic_data/32_kinship_matrix.xHemi.kinship ] successfully in [ 0.4 ] seconds. [ERROR] Unexpected column 'u997'! [WARN] Failed to load spectral decomposition results of the kinship matrix! [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.1 ] seconds. [INFO] Impute missing genotype to mean (by default) [INFO] Use dosage genotype from VCF flag DS. [INFO] Analysis started [INFO] Analyzed [ 830117 ] variants [INFO] Analysis ends at: Wed May 27 16:26:39 2020 [INFO] Analysis took 1005 seconds

I've looked at the .kinship and .xHemi.kinship files and do not see the string "u997" listed anywhere. What's strange is that I'm running this same code on loop for many different phenotypes, and sometimes rvtests runs (using the same kinship and xHemi.kinship files) without this warning and error appearing.

Any idea what might be going on?

dpanyard avatar May 27 '20 21:05 dpanyard

Can you check if this sample is in the vcf file?

Sent from my iPhone

On May 27, 2020, at 4:49 PM, Daniel Panyard [email protected] wrote:

 Hi, I've encountered a warning and error in rvtests log when I try to conduct a GWAS. My code looks like this:

./software/rvtests/v2.1.0/executable/rvtest
--inVcf ./derived_data/genetic_data/30_chr12.imputed.poly.vcf.gz
--dosage DS
--kinship ./derived_data/genetic_data/32_kinship_matrix.kinship
--xHemiKinship ./derived_data/genetic_data/32_kinship_matrix.xHemi.kinship
--pheno ./derived_data/genetic_data/32_chrAll_phenotypes.txt
--pheno-name pheno48428
--meta score
--out ./derived_data/x_temp_gwas_test_run In my log, I see the following:

[INFO] Family-based model specified. Loading kinship file... [INFO] DONE: Loaded kinship file [ ./derived_data/genetic_data/32_kinship_matrix.kinship ] successfully in [ 0.5 ] seconds. [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.0 ] seconds. [INFO] Kinship eigen-decomposition file [ .xHemi.eigen ] detected and will be used for for X chromosome analysis [INFO] DONE: Loaded kinship file [ ./derived_data/genetic_data/32_kinship_matrix.xHemi.kinship ] successfully in [ 0.4 ] seconds. [ERROR] Unexpected column 'u997'! [WARN] Failed to load spectral decomposition results of the kinship matrix! [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.1 ] seconds. [INFO] Impute missing genotype to mean (by default) [INFO] Use dosage genotype from VCF flag DS. [INFO] Analysis started [INFO] Analyzed [ 830117 ] variants [INFO] Analysis ends at: Wed May 27 16:26:39 2020 [INFO] Analysis took 1005 seconds

I've looked at the .kinship and .xHemi.kinship files and do not see the string "u997" listed anywhere. What's strange is that I'm running this same code on loop for many different phenotypes, and sometimes rvtests runs (using the same kinship and xHemi.kinship files) without this warning and error appearing.

Any idea what might be going on?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

zhanxw avatar May 28 '20 01:05 zhanxw

Hi, I'm not sure you're referring to by "this sample", but if you mean the string "u997", then that string is not an identifier for a sample in this data set and neither is the string "997". I used... grep u997 and confirmed that the string "u997" is not found in the vcf file, .kinship, or .xHemi.kinship files. I'm really not sure where that string could be coming from.

dpanyard avatar May 28 '20 18:05 dpanyard

I think I've noticed a pattern with these errors. The "unexpected column" will have a number of one more than the number of samples with valid phenotypes in the analysis. So, for example, if I have a data set with 1000 samples, but the given phenotype is only present for 850 samples, then the error will read "unexpected column 'u851'".

The broader pipeline I'm using is that I have a total genetic data set with about 1,000 samples, and I'm running rvtests multiple times on this same sample, each time with a different phenotype. However, some of the phenotypes I use have missing values for some of the samples. Thus, there are many cases where I want to run rvtests using a kinship matrix calculated from 1,000 samples but for a phenotype where only, say, 800 samples actually have a valid phenotype. It seems like it's in these cases that I'm getting the error and warning.

Since rvtests continues to run the analysis anyway, I guess the question is whether rvtests is able to handle the mismatch between the samples present in the kinship matrix and the samples with valid phenotypes. Do you think this might be the source of the error?

dpanyard avatar Jun 12 '20 01:06 dpanyard

I discovered today that if I run rvtests but without including the xHemi.kinship file, the error doesn't occur (but of course then it skips the X chromosome for the analysis). So, potentially there's something wrong with the xHemi.kinship file.

dpanyard avatar Jun 17 '20 21:06 dpanyard

A couple more things I've tried: I tried using a tab-delimited phenotype file instead of space-delimited, but that did not prevent the error.

I tried generating and using a kinship file that included only the samples that had nonmissing values for the outcome where the error occurs, but that also did not prevent the error.

dpanyard avatar Jun 18 '20 20:06 dpanyard

The error seems to occur during KinshipHolder.cpp, line 358

Looks like the "u997" column must be coming from the eigen file (this->eigenFileName). I just (re)discovered that there's a hidden file called .xHemi.eigen, generated on 3/19/2020 sitting in the top-level project directory from where I run my code, including rvtest, but that would mean this file is really out of date from my testing this month in June. This directory is the place from which I run all code. There's no eigen file there for the autosomes from what I see.

This xHemi eigen file looks to contain an 1,118 x 1,118 matrix of values. The rows are stored with the actual sample ID, but the columns contains sequentially numbered strings like U1, U2, U3..., with the U there for an unknown reason. This at least explains the letter "u" in the error log, I assume.

And now I'm seeing the line in the log file, before the error, that says this hidden eigen-decomposition file [.xHemi.eigen] was detected and used: [INFO] Kinship eigen-decomposition file [ .xHemi.eigen ] detected and will be used for for X chromosome analysis [INFO] DONE: Loaded kinship file [ ./derived_data/wrap_genetic_data/32_kinship_matrix.xHemi.kinship ] successfully in [ 0.4 ] seconds. [ERROR] Unexpected column 'u803'! [WARN] Failed to load spectral decomposition results of the kinship matrix! [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 0.7 ] seconds.

There's no corresponding .eigen file for the autosomes, apparently.

If I temporarily rename that old .xHemi.eigen file, that should make it so rvtests can't detect it....

Now the code seems to be running without an error. Here's the log: [INFO] Analysis begins with [ 996 ] samples... [INFO] Family-based model specified. Loading kinship file... [INFO] DONE: Loaded kinship file [ ./derived_data/wrap_genetic_data/archive/troubleshooting_rvtest _error/met48428_matrix.kinship ] successfully in [ 0.6 ] seconds. [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.8 ] seconds. [INFO] Kinship file [ ./derived_data/wrap_genetic_data/archive/troubleshooting_rvtest_error/met484 28_matrix.xHemi.kinship ] detected and will be used for for X chromosome analysis [INFO] DONE: Loaded kinship file [ ./derived_data/wrap_genetic_data/archive/troubleshooting_rvtest _error/met48428_matrix.xHemi.kinship ] successfully in [ 0.7 ] seconds. [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.6 ] seconds. [INFO] Impute missing genotype to mean (by default) [INFO] Use dosage genotype from VCF flag DS. [INFO] Analysis started

We can see in DataConsolidator.cpp, line 587, where this file name of .xHemi.eigen is set and loaded.

But a new .xHemi.eigen hidden file has been created in the working directory. Importantly, it appears that this new .eigen file only has 996 "U" columns, presumably one for each of the samples with non-missing phenotypes in this current analysis. That means the .xHemi.eigen file is different for each phenotype when the number of samples is different. When this file is present in the working directory, it gets loaded. This leads to the error/warning when the current .eigen file has stuck around from the last rvtests run but it doesn't correspond to the current data set being analyzed.

If this .xHemi.eigen file is somehow created at each run of rvtests, would that cause issues with running multiple iterations of rvtest at the same time in parallel in the same working directory? Or, is there the possibility that this eigen file does not get erased after the code is run and it is mistakenly used again the next time rvtests is used from that same directory? I'm testing the latter situation now.

dpanyard avatar Jun 18 '20 21:06 dpanyard

It looks like the .xHemi.eigen file is not cleaned up after rvtests runs.

dpanyard avatar Jun 18 '20 22:06 dpanyard

Now, if I'm understanding what's happening correctly, I think this issue might be benign if, when rvtests detects any sort of unexpected problem with the loading of .xHemi.eigen, it just forgets about what's currently in the file and recalculates the kinship eigen-decomposition manually. However, I could imagine a niche case where someone runs rvtests with a kinship matrix for, say, 457 samples, and then they realize they need to change how they calculate that kinship matrix in some way, so they recreate it and then run rvtests again. In this situation, the old .xHemi.eigen file would be loaded and it might have the wrong values. Unless rvtests is able to identify this situation, rvtests would incorrectly use the old eigen-decomposition values instead of recalculating them.

More generally, if there is any gap in rvtests ability to detect situations when the existing .xHemi.eigen file does not exactly correspond to the current X chromosome kinship file, then rvtests could be using incorrect values.

I could also see the potential for issues when someone runs rvtests multiple times in parallel, because all of those processes would be trying to use the .xHemi.eigen file at the same time, maybe even trying to write to the same file at once if the file didn't exist at the beginning.

dpanyard avatar Jun 18 '20 22:06 dpanyard

So, some potential changes to help mitigate this issue:

  • If reasonable, don't save and use the .xHemi.eigen file at all--just calculate the kinship eigen-decomposition for the X chromosome each time. It seems like this is already the case with the autosome kinship eigen-decomposition, so it might be reasonable to do the same for the X chromosome. This fix would prevent the possibility of any issues due to running scripts in parallel.

  • If the above isn't possible, then use a run-specific file name instead of .xHemi.eigen. I imagine you could arbitrarily pick a number or use the timestamp plus the process ID or something in the file name. That way, each run of rvtests uses a different file to prevent the issues described above.

  • Don't make .xHemi.eigen hidden. Allow users to see the file and make it something they can input directly into rvtests as an optional argument. Whatever benefit there is to having the file would be kept, but it would no longer be hidden and potentially unintentionally loaded. Users would be able to control how the reuse of this .eigen file occurs and thus ensure that it's used properly.

  • Delete the .xHemi.eigen file after rvtests finishes running. This is not an ideal solution. It loses any benefit of keeping the file there, and it could still leave the file behind if a user manually kills an rvtests process, but at least it would prevent unintentional issues.

dpanyard avatar Jun 18 '20 22:06 dpanyard

For myself (and others who might run into this), it seems like a potential workaround could be to include a line of code in a shell script that would manually remove the .xHemi.eigen file after each run of rvtests. If running in parallel, some care might be needed to ensure the file isn't removed while another process is trying to read it, but for my purposes, I think I'll be able to run all of the chromosome-specific tests for one outcome in parallel, then remove .xHemi.eigen, and then continue on with the next outcome's analyses.

dpanyard avatar Jun 18 '20 22:06 dpanyard

This is great investigative work @dpanyard !

agilly avatar Dec 09 '20 03:12 agilly

This is great investigative work @dpanyard !

Thanks! Glad someone else read my detective-novel-of-a-post here. Out of curiosity, have you encountered this warning message as well? I haven't heard of any official solution to this issue yet.

dpanyard avatar Dec 09 '20 17:12 dpanyard

Yes, one of our analysts is trying out RVtests and is having a really hard time understanding what it's doing and working her way through these warnings... For now I recommended to use a different directory for each analysis which should get rid of at least this problem :)

agilly avatar Dec 17 '20 09:12 agilly