xqtl-protocol
xqtl-protocol copied to clipboard
When to intersect samples among genotype, phenotype and covariates
Complication
Genotype and covariates are possibly shared across all studies and phenotypes are unique to each one. Sometimes the overlap is large and the few non-overlapping samples are negligible and can be removed at any point in the analysis. Sometimes a phenotype can have much less samples than it is available in genotype data (as is the case for the data @hsun3163 is currently analyzing).
Preparation
We should create a look-up file of 2 columns:
sample_name_in_pheno(and cov), sample_name_in_geno
that takes only the OVERLAP between these data-set. This will also serve as a sample name matching file if sample names dont agree.
We ask users to provide this, in case they want to exclude samples for other reasons. Our analysis will be focused on these samples when applicatble
Genotype
- Variant level QC should be based on all samples -- we have been doing that with the VCF pipeline but not yet the PLINK format input (we do that at the very end).
- PCA derived from genotype data is ideally performed on each phenotype separately
@hsun3163 :
-
we should take overlapping samples right after VCF QC and before KING, to generate markers (MAF5%+, LD pruned) and compute PCA per study. We will then remove outliers based on PCA results from multiple studies. We remove them on the full genotype data.
-
The look-up file may be adjusted for outliers.
Phenotype
- using the look-up file we remove samples after QC before normalization and gene data annotation
- in the rNA-seq normalization pipeline, the required sample_lookup_file should be derived from our look-up file (if not used as is)
Covariates
- Covariate data filter should happen before factor analysis, using the look-up file
- for APEX, we can use this look-up file to create on the fly a VCF file with header only, https://github.com/hsun3163/neuro-apex/issues/1#issuecomment-876715665
@hsun3163 Let me know what you think I'm missing