gemBS icon indicating copy to clipboard operation
gemBS copied to clipboard

SNP extraction

Open IsmailM opened this issue 5 years ago • 2 comments

When initally running gemBS extract, I do not see any variants in the _snps.txt.gz files.

The gemBS config refers to the following Reference and DBSNP:

...
reference = ${root}/ref/human_g1k_v37.fasta.gz
dbSNP_files = ${root}/ref/dbsnp/GCF_000001405.25.gz
dbSNP_index = ${root}/ref/dbsnp/dbsnp.index
...

These were downloaded as follows:

# hs37-1kg GRCh37
curl -LO ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz

# DBSNP - dbSNP 2.0 Build 153 (downloaded on 24th July 2020)
curl -LO https://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.25.gz

All commands completed with no errors but as mentioned above, I do not see any snps in the output files.

However, I noticed that the reference I used uses chromosome numbers while dbsnp uses the chromeome accession numbers (i.e. NC_000001.10 instead of 1.

As such, I ran the following to rename the chromosome column in the dbsnp vcf:

ASSEMBLY_REPORT=GCF_000001405.25_GRCh37.p13_assembly_report.txt;
curl -L ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_assembly_report.txt -o ${ASSEMBLY_REPORT};

grep -e '^[^#]' ${ASSEMBLY_REPORT} | awk '{ print $7, $1 }' > ${ASSEMBLY_REPORT}.chr_names;
bcftools annotate --rename-chrs ${ASSEMBLY_REPORT}.chr_names --threads 60 -Oz -o GRCh37.dbSNP153.vcf.gz GCF_000001405.25.gz;

Now, if I update the gemBS config, and then rerun gemBS prepare.

I now see the following error when running gemBS index

$ nice -n 10 gemBS index -t 70
2020-09-04 19:12:27,824 : gemBS reference /home/ucbtmog/d/phd/gembs/ref/human_g1k_v37.gemBS.ref already exists, skipping creation
2020-09-04 19:12:27,825 : Bisulphite Index /home/ucbtmog/d/phd/gembs/ref/human_g1k_v37.BS.gem already exists, skipping indexing
/home/ucbtmog/d/phd/gembs/ref/dbsnp/GRCh37.dbSNP153.vcf.gz
*** Error in `/home/ucbtmog/.asdf/installs/python/3.8.3/lib/python3.8/site-packages/gemBS/gemBSbinaries/dbSNP_idx': free(): invalid pointer: 0x00007efbfc3bf750
***
======= Backtrace: =========
/lib64/libc.so.6(+0x81489)[0x7efe7390c489]
/home/ucbtmog/.asdf/installs/python/3.8.3/lib/python3.8/site-packages/gemBS/gemBSbinaries/dbSNP_idx[0x404d20]
/home/ucbtmog/.asdf/installs/python/3.8.3/lib/python3.8/site-packages/gemBS/gemBSbinaries/dbSNP_idx[0x405051]
/lib64/libpthread.so.0(+0x7dd5)[0x7efe73c5fdd5]
/lib64/libc.so.6(clone+0x6d)[0x7efe73988ead]
======= Memory map: ========
00400000-0040e000 r-xp 00000000 fd:02 28502397                           /home/ucbtmog/.asdf/installs/python/3.8.3/lib/python3.8/site-packages/gemBS/gemBSbinaries/dbSNP_idx
0060d000-0060e000 r--p 0000d000 fd:02 28502397                           /home/ucbtmog/.asdf/installs/python/3.8.3/lib/python3.8/site-packages/gemBS/gemBSbinaries/dbSNP_idx
0060e000-0060f000 rw-p 0000e000 fd:02 28502397                           /home/ucbtmog/.asdf/installs/python/3.8.3/lib/python3.8/site-packages/gemBS/gemBSbinaries/dbSNP_idx
019ec000-1020c000 rw-p 00000000 00:00 0                                  [heap]
7efabc000000-7efabc043000 rw-p 00000000 00:00 0

...

7efbb4043000-7efbb8000000 ---p 00000000 00:00 0 2020-09-04 19:28:14,057 ERROR: Process '/home/ucbtmog/.asdf/installs/python/3.8.3/lib/python3.8/site-packages/gemBS/gemBSbinaries/dbSNP_idx' finished with -6
2020-09-04 19:28:14,058 ERROR: Reading from /home/ucbtmog/d/phd/gembs/ref/dbsnp/GRCh37.dbSNP153.vcf.gz
2020-09-04 19:28:14,058 ERROR: Added new prefix 0, 'rs'
2020-09-04 19:28:14,058 ERROR: Switching to sorted mode (ctg = HG237_PATCH)
2020-09-04 19:28:14,058 ERROR: Writing output HG237_PATCH
2020-09-04 19:28:14,058 ERROR: Writing output HG1350_HG959_PATCH
2020-09-04 19:28:14,058 ERROR: Writing output HG1591_PATCH
2020-09-04 19:28:14,059 ERROR: Writing output HG979_PATCH
2020-09-04 19:28:14,059 ERROR: Writing output HG1699_PATCH

...

ValueError: Error while executing dbSNP-indexer

IsmailM avatar Sep 08 '20 14:09 IsmailM

Hi,

I managed to fix the above gemBS index error by extracting chromosomes 1-22 XY from the VCF (it seems that it was the PATCH chromosomes in the dbsnp vcf that were causing the error).

However, I am still unable to extract any SNPs - even after ensuring the reference and dbSNP use the same chromosome formating.

Do you have any suggestions?

IsmailM avatar Sep 09 '20 14:09 IsmailM

Hi, Read the software help document carefully! http://statgen.cnag.cat/gemBS/v3/UserGuide/_build/html/pipelineCall.html

In gemBS software working directory execute the following commands can extract wgbs samples SNPs info:

-c, --cpg Output gemBS bed with cpg sites.

-N, --non-cpg Output gemBS bed with non-cpg sites.

-B, --bed-methyl Output bedMethyl files (bed and bigBed)

-S, --snps Output SNPs

-t THREADS, --extract-threads THREADS

gemBS extract --snp-db /home/user/database/gemBS/hg38/dbsnp_141.hg38.vcf.idx -S -c -N -B -t 32 Also, you can put codes into script file, and use nohup to submit gemBS extract analysis mission, like as "nohup bash gemBS_extract.sh > run_gemBS_extract.sh.log 2>&1 &"

The following is the gemBS software command help information: gemBS extract --help usage: gemBS extract [-h] [-j JOBS] [-n SAMPLE_NAME] [-b SAMPLE_BARCODE] [-s] [-W] [-q PHRED] [-I INFORM] [-M MIN_NC] [-H] [-R REF_BIAS] [-c] [-N] [-B] [-S] [-t THREADS] [--snp-list SNP_LIST] [--snp-db SNP_DB] [--dry-run] [--json JSON FILE] [--ignore-db] [--ignore-dep]

Extracts summary files from BCF files generated for all or a subset of samples to produce a series of summary output files. The detailed formats of the output files are given in the gemBS docuemntation. The default output are CpG files. These are BED3+8 format files with information on methylation and genotypes. A list of non-CpG sites in the same basic format can also be produced. Various options on filtering these files on genotype quality and coverage are available. By default the CpG files have 1 output line per CpG (so the information from the two strands is combined). This can be changed to give strand specific information using the -s option. Standard filtering strategy is to only output sites where the sample genotype is called as being homozygous CG/CG with a phred score >= to the theshold set using the -q option (default 20). Using the --allow-het option will allow heterozygous CpG sites to be included in the output. The sitest can also be filtered on minimum informative coverage using the -I option (default = 1). For non-CpG sites the strategy is to only output sites with a minimum number of non-converted reads. This level can be set using the --min-nc option (default = 1). A second set of extracted outputs that correspond to the ENCODE WGBS pipeline are also available using the --bed-methyl and --bigwig options. The --bed-methyl option will produce three files per sample for all covered sites in CpG, CHG and CHH context in BED9+5 format. Each of the files will also be generated in bigBed format for display in genome browsers. In addition a bigWig format file will be generated giving the methylation percentage at all covered cytosine sites (informative coverage > 0). If the --strand-specific option is given then two bigWig files will be geenrated - one for each strand. For the ENCODE output files, not further filtering is performed. In addition to the methylation result, SNP genotypes can also be extracted with the --snps options. By default, this will return a file with genotypes on all SNPs covered by the experiment that were in the dbSNP_idx file used for the calling stage. This selection can be refined uwing the --snp-list option, which is a file with a list of SNP ids, one id per line. An alternate dbSNP_idx file can also be supplied using the --snp-db option, allowing SNPs that were not in the original dbSNP_idx file used for calling to be extracted. The --dry-run option will output a list of the merging operations that would be run by the merge- bcfs command without executing any of the commands. The --json <JSON OUTPUT> options is similar to --dry-run, but writes the commands to be executed in JSON format to the supplied output file, including information about the input and output files for the commands. The --ignore-db option modifies the --dry- run and --json options such that the database is not consulted (i.e., gemBS assumes that no calling has already been completed but that all dependencies (i.e., BAM files) are available. The --ignore-dep option is similar - it ignores dependencies, but does check whether a task has already been completed.

optional arguments: -h, --help show this help message and exit -j JOBS, --jobs JOBS Number of parallel jobs -n SAMPLE_NAME, --sample-name SAMPLE_NAME Name of sample to be filtered -b SAMPLE_BARCODE, --barcode SAMPLE_BARCODE Barcode of sample to be filtered -s, --strand-specific Output separate lines in CpG file for each strand. -W, --bigwig-strand-specific Output separate bigWig files for each strand. -q PHRED, --phred-threshold PHRED Min threshold for genotype phred score. -I INFORM, --min-inform INFORM Min threshold for informative reads. -M MIN_NC, --min-nc MIN_NC Min threshold for non-converted reads for non CpG sites. -H, --allow-het Allow both heterozygous and homozgyous sites. -R REF_BIAS, --reference-bias REF_BIAS Allow both heterozygous and homozgyous sites. -c, --cpg Output gemBS bed with cpg sites. -N, --non-cpg Output gemBS bed with non-cpg sites. -B, --bed-methyl Output bedMethyl files (bed and bigBed) -S, --snps Output SNPs -t THREADS, --extract-threads THREADS Number of extra threads for extract step --snp-list SNP_LIST List of SNPs to output --snp-db SNP_DB dbSNP_idx processed SNP idx --dry-run Output mapping commands without execution --json JSON FILE Output JSON file with details of pending commands --ignore-db Ignore database for --dry-run and --json commands --ignore-dep Ignore dependencies for --dry-run and --json commands

JING-XINXING avatar Oct 09 '22 05:10 JING-XINXING