Atlas-CNV
Atlas-CNV copied to clipboard
Requesting information on how to create sample and panel files.
Where can I find more information on how to create the panel and the sample fies?
I went through the paper and it says
- (1) Genome Analysis Toolkit (GATK) DoC interval summary files
- (2) a panel design containing target exons, and
- (3) a sample file with gender and/or midpool groupings.
The first file is not mentioned in the github ReadMe
I am really confused. Please help.
(1) we use version 3 of the GATK software from Broad Institute to compute the Depth of Coverage on a given bam file.
@theodorc
I am looking at the usage doc. Where is the GATK v3 file used as input ?
Additionally, how do I create panel file?
Exon_Target Gene_Exon Call_CNV RefSeq
1:1220087-1220186 SNP_1 N rs2144440
1:3083663-3083762 SNP_2 N rs2651899
1:3611843-3611942 SNP_3 N rs3765731
1:6279321-6279420 RNF207-001_18 N rs846111
1:8487274-8487373 SNP_4 N rs301797
1:11850737-11850955 MTHFR-001_11 Y NM_005957_cds_0
1:11851264-11851383 MTHFR-001_10 Y NM_005957_cds_1
1:11852335-11852436 MTHFR-001_9 Y NM_005957_cds_2
1:11853964-11854146 MTHFR-001_8 Y NM_005957_cds_3
The Gene_Exon
column contains what ? SNP Ids or gene / exon ids? Also, the "RefSeq" column contains dbsnp rs ids ? is that correct ? I also see NM ids (transcript ids)?
And finally, Call_CNVs
column contains yes/no values - how to make that decision?
Sorry for the late response. Hope the comments below helps.
-
For GATK, see the config file. In there is variable to specify the directory (and file name format) where you have the GATK Depth of Coverage file: GATKDIR=GATK_DoC/[SAMPLE_FCLBC].DATA.sample_interval_summary
-
Panel file is created by yourself in your favorite editor. It is usually based on the capture designed you used for the sequencing. For example, a cancer panel will contain genes for cancer and their exon target coordinates etc...
-
The Gene_Exon column is the name of the target exon used. In the example, I used gene MTHFR and -001 for the transcript id, and _11 for exon. The same idea for RefSeq column.
-
Finally the Call_CNV is designates whether you want to include this given target in the analysis. Usually you say N if you know somehow this target is not reliable when the data is produced (ie. target is too small or data is known to be noisy).