xqtl-protocol When to intersect samples among genotype, phenotype and covariates

When to intersect samples among genotype, phenotype and covariates

Open gaow opened this issue 3 years ago • 6 comments

Complication

Genotype and covariates are possibly shared across all studies and phenotypes are unique to each one. Sometimes the overlap is large and the few non-overlapping samples are negligible and can be removed at any point in the analysis. Sometimes a phenotype can have much less samples than it is available in genotype data (as is the case for the data @hsun3163 is currently analyzing).

Preparation

We should create a look-up file of 2 columns:

sample_name_in_pheno(and cov), sample_name_in_geno

that takes only the OVERLAP between these data-set. This will also serve as a sample name matching file if sample names dont agree.

We ask users to provide this, in case they want to exclude samples for other reasons. Our analysis will be focused on these samples when applicatble

Genotype

Variant level QC should be based on all samples -- we have been doing that with the VCF pipeline but not yet the PLINK format input (we do that at the very end).
PCA derived from genotype data is ideally performed on each phenotype separately

@hsun3163 :

we should take overlapping samples right after VCF QC and before KING, to generate markers (MAF5%+, LD pruned) and compute PCA per study. We will then remove outliers based on PCA results from multiple studies. We remove them on the full genotype data.
The look-up file may be adjusted for outliers.

Phenotype

using the look-up file we remove samples after QC before normalization and gene data annotation
in the rNA-seq normalization pipeline, the required sample_lookup_file should be derived from our look-up file (if not used as is)

Covariates

Covariate data filter should happen before factor analysis, using the look-up file
for APEX, we can use this look-up file to create on the fly a VCF file with header only, https://github.com/hsun3163/neuro-apex/issues/1#issuecomment-876715665

@hsun3163 Let me know what you think I'm missing

Feb 10 '22 17:02 gaow

xqtl-protocol xqtl-protocol copied to clipboard

When to intersect samples among genotype, phenotype and covariates

Complication

Preparation

Genotype

Phenotype

Covariates

xqtl-protocol
xqtl-protocol copied to clipboard