gemBS Specify target regions for snp calling && Extract SNPs only within target regions

Hi Simon,

Many thanks for your help in my last issue. It has been fixed. I just have another question: does gemBS have any parameter to let you specify target regions and only extract the SNPs inside them?

For the other snp-calling tools bis-snp and bs-snper, they have the functionality to call snps within target regions only. Normally, you can input a .bed file as a parameter (normally, it contains 3 columns: chr, start_position, end_position). The snp-callers will only return called snps that match chromosome and within those regions.

I went through the documentation of gemBS again but unfortunately did not find any information related to this functionality. The closest I got is "--snp-list" in "gemBS extract". There is a way for me to extract a list of SNPs (rs numbers) from dbsnp that fall into target regions, and use that list of rs-numbers as a input of "--snp-list" to help extract what I need.

If gemBS has an option for you to input a file of target regions but I missed it, please point it out to me. Also, just want to make sure with you: what is the input format of the parameter "gemBS extract -snp-list"? Is it a .bed file or .txt file with snp rs-ids?

@heathsc Thank you very much for your attention! Your help is greatly appreciated.

Andy Wang

Research Assistant Daley Lab | UBC Center for Heart & Lung Innovation Room 166, 1081 Burrard Street | St. Paul's Hospital Vancouver, B.C. Canada, V6Z 1Y6

Mar 16 '20 23:03 AndyWangSFU

Hi Andy,

While it can be done to force gemBS to only give calls for certain regions (the underlying tools do allow this, but this functionality is not exposed by gemBS) but why do you want to do this? The output files are all indexed and so you can extract given regions from the CpG or snp output files using tabix, and the bigBed and bigWig files using the normal viewers. The reason that the extract command works over the whole genome is because of the bigBed and bigWig generation process - merging chromosome or regions specific files is an expensive and complicated process so it generally makes more sense to generate the file over the whole genome and then extract the regions from the final output files.

If you do need to restrict processing to a particular region or set of regions for reasons of space or computational time, let me know and I can give some pointers as to how to do this.

Cheers, Simon

On Tue, Mar 17, 2020 at 12:51 AM Andy Wang [email protected] wrote:

Hi Simon,

Many thanks for your help in my last issue. It has been fixed. I just have another question: does gemBS have any parameter to let you specify target regions and only extract the SNPs inside them?

For the other snp-calling tools bis-snp and bs-snper, they have the functionality to call snps within target regions only. Normally, you can input a .bed file as a parameter (normally, it contains 3 columns: chr, start_position, end_position). The snp-callers will only return called snps that match chromosome and within those regions.

I went through the documentation of gemBS again but unfortunately did not find any information related to this functionality. The closest I got is "--snp-list" in "gemBS extract". There is a way for me to extract a list of SNPs (rs numbers) from dbsnp that fall into target regions, and use that list of rs-numbers as a input of "--snp-list" to help extract what I need.

If gemBS has an option for you to input a file of target regions but I missed it, please point it out to me. Also, just want to make sure with you: what is the input format of the parameter "gemBS extract -snp-list"? Is it a .bed file or .txt file with snp rs-ids?

Thank you very much for your attention! Your help is greatly appreciated.

Andy Wang

Research Assistant Daley Lab | UBC Center for Heart & Lung Innovation Room 166, 1081 Burrard Street | St. Paul's Hospital Vancouver, B.C. Canada, V6Z 1Y6

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/72, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY465ZGGNVA2YKGACTPL2TRH23RRANCNFSM4LMZKL5Q .

Mar 17 '20 06:03 heathsc

Hi Simon,

Thank you for your quick response. The main problem is unfortunately the disk issue... I am surprised that for only 1 sample, "gemBS call" created around 100GB .bcf files (24 chromosomes + 22 pools for alt, decoy or unknown chromosomes. 2.21GB .bcf file for each). And to merge them into a single .bcf file, it requires at least additional 100GB (my pipeline stops here).

To be honest, I did not expect it to be that large and our cluster disk was not well-equipped for this. But there could be some mistake happening before. For now, I am trying to re-run the snp calling and possibly, see whether we could save the "CpG" results only, not "CHH" or the others to save space and time.

I do expect that if we could specify the regions at first, the computational time will be faster than getting all SNPs and extracting them later. If you could give me a hint on how to do it, I will be very grateful on this.

Thanks again, Andy

Mar 17 '20 18:03 AndyWangSFU

Sorry one more question:

I just run "gemBS call" and found an error in the bs_call.err report file. It looks like gemBS fails loading the dbSNP index file. Do you know the reason?

Loading dbSNP header from indexes/dbSNP_gemBS.idx Invalid format [E::idx_find_and_load] Could not retrieve index file for '/home/awang01/gemBS/test/mapping/CD294ANXX_6_GCCAAT/CD294ANXX_6_GCCAAT.bam' Additional threads: 3 2 2 Loading reference sequence index Sequence index read in successfully

As suggested, the dbSNP index files I used come from dbSNP ftp server (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/BED/). From the "gemBS index" step, there is no error returned for dbSNP indexing. Everything looks good, like:

$ gemBS index : : Command index started at 2020-03-17 12:43:42.081160 : : Creating index : Index done: <function index at 0x7f9c68e123a0> : Contig md5 file created: indexes/Homo_sapiens_assembly38.gemBS.contig_md5 dbsnp/*.bed.gz : dbSNP index done: /home/awang01/gemBS/test/indexes/dbSNP_gemBS.idx indexes/Homo_sapiens_assembly38.BS.gem indexes/Homo_sapiens_assembly38.BS : Contig sizes file done: /home/awang01/gemBS/test/indexes/Homo_sapiens_assembly38.contig.sizes

And the parameters in configuration file are:

dbSNP_files = dbsnp/*.bed.gz dbSNP_index = indexes/dbSNP_gemBS.idx

I really don't know why it fails loading the dbSNP index... Thank you very much for your attention.

Best, Andy

Mar 18 '20 01:03 AndyWangSFU

Hi Andy,

The format for the dbSNP index has changed a lot in recent releases of gemBS. Was the index created with the same version of gemBS that you are using for the calling?

Simon

On Wed, Mar 18, 2020 at 2:19 AM Andy Wang [email protected] wrote:

Sorry one more question:

I just run "gemBS call" and found an error in the bs_call.err report file. It looks like gemBS fails loading the dbSNP index file. Do you know the reason?

Loading dbSNP header from indexes/dbSNP_gemBS.idx Invalid format [E::idx_find_and_load] Could not retrieve index file for '/home/awang01/gemBS/test/mapping/CD294ANXX_6_GCCAAT/CD294ANXX_6_GCCAAT.bam' Additional threads: 3 2 2 Loading reference sequence index Sequence index read in successfully

From the "gemBS index" step, there is no error returned for dbSNP indexing, like:

$ gemBS index : : Command index started at 2020-03-17 12:43:42.081160 : : Creating index : Index done: <function index at 0x7f9c68e123a0> : Contig md5 file created: indexes/Homo_sapiens_assembly38.gemBS.contig_md5 dbsnp/*.bed.gz : dbSNP index done: /home/awang01/gemBS1/test/indexes/dbSNP_gemBS.idx indexes/Homo_sapiens_assembly38.BS.gem indexes/Homo_sapiens_assembly38.BS : Contig sizes file done: /home/awang01/gemBS1/test/indexes/Homo_sapiens_assembly38.contig.sizes

Thank you very much for your attention.

Best, Andy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/72#issuecomment-600376327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4655NIESJGWCPE35GJEDRIAOTJANCNFSM4LMZKL5Q .

Mar 18 '20 07:03 heathsc

Hi Andy,

If you use the underlying tools (bs_call, mextr and snpxtr for the calling, methylation and SNP extraction) there are options for restricting operations to a list of regions. If you run gemBS with the --loglevel debug option you can see the options that are normally used for normal operation.

Simon

On Tue, Mar 17, 2020 at 7:40 PM Andy Wang [email protected] wrote:

Hi Simon,

Thank you for your quick response. The main problem is unfortunately the disk issue... I am surprised that for only 1 sample, "gemBS call" created around 100GB .bcf files (24 chromosomes + 22 pools for alt, decoy or unknown chromosomes. 2.21GB .bcf file for each). And to merge them into a single .bcf file, it requires at least additional 100GB. To be honest, I did not expect it to be that large and our cluster disk was not well-equipped for this.

I do expect that if we could specify the regions at first, the computational time will be faster than getting all SNPs and extracting them later. If you could give me a hint on how to do it, I will be very grateful on this.

Thanks again, Andy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/72#issuecomment-600234490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4656KGGM7SQOHBKKM773RH67YJANCNFSM4LMZKL5Q .

Mar 18 '20 09:03 heathsc

Hi Andy, The format for the dbSNP index has changed a lot in recent releases of gemBS. Was the index created with the same version of gemBS that you are using for the calling? Simon … On Wed, Mar 18, 2020 at 2:19 AM Andy Wang @.**> wrote: Sorry one more question: I just run "gemBS call" and found an error in the bs_call.err report file. It looks like gemBS fails loading the dbSNP index file. Do you know the reason? Loading dbSNP header from indexes/dbSNP_gemBS.idx Invalid format [E::idx_find_and_load] Could not retrieve index file for '/home/awang01/gemBS/test/mapping/CD294ANXX_6_GCCAAT/CD294ANXX_6_GCCAAT.bam' Additional threads: 3 2 2 Loading reference sequence index Sequence index read in successfully From the "gemBS index" step, there is no error returned for dbSNP indexing, like: $ gemBS index : : Command index started at 2020-03-17 12:43:42.081160 : : Creating index : Index done: <function index at 0x7f9c68e123a0> : Contig md5 file created: indexes/Homo_sapiens_assembly38.gemBS.contig_md5 dbsnp/.bed.gz : dbSNP index done: /home/awang01/gemBS1/test/indexes/dbSNP_gemBS.idx indexes/Homo_sapiens_assembly38.BS.gem indexes/Homo_sapiens_assembly38.BS : Contig sizes file done: /home/awang01/gemBS1/test/indexes/Homo_sapiens_assembly38.contig.sizes Thank you very much for your attention. Best, Andy — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#72 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4655NIESJGWCPE35GJEDRIAOTJANCNFSM4LMZKL5Q .

Hi Simon,

Yeah. I rerun the whole snp calling process and got this latest dbSNP index from gemBS 3.5.0. Unfortunately, the "gemBS call" step still shows an invalid format error on dbSNP index.

I suppose these files are qualified to what gemBS requires in its documentation of dbSNP:

the user should download the SNP files from dbSNP (they form a collection of BED3+3 files (one per contig) with the public SNP ID being in column 4).

There's no much information on error message... If you know where goes wrong, please tell me.

Thanks, Andy

Mar 19 '20 00:03 AndyWangSFU

Hi Andy, If you use the underlying tools (bs_call, mextr and snpxtr for the calling, methylation and SNP extraction) there are options for restricting operations to a list of regions. If you run gemBS with the --loglevel debug option you can see the options that are normally used for normal operation. Simon … On Tue, Mar 17, 2020 at 7:40 PM Andy Wang @.***> wrote: Hi Simon, Thank you for your quick response. The main problem is unfortunately the disk issue... I am surprised that for only 1 sample, "gemBS call" created around 100GB .bcf files (24 chromosomes + 22 pools for alt, decoy or unknown chromosomes. 2.21GB .bcf file for each). And to merge them into a single .bcf file, it requires at least additional 100GB. To be honest, I did not expect it to be that large and our cluster disk was not well-equipped for this. I do expect that if we could specify the regions at first, the computational time will be faster than getting all SNPs and extracting them later. If you could give me a hint on how to do it, I will be very grateful on this. Thanks again, Andy — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#72 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4656KGGM7SQOHBKKM773RH67YJANCNFSM4LMZKL5Q .

Hi Simon,

Thanks for your reply. We are quite struggled with disk and memory issues. Even if we cleaned 200GB for this only one testing sample, the merge step still failed. I suppose the merged .bcf file requires at least 100GB, in our case.

Could you please address "using underlying tools" more in detail? I tried "gemBS bs_call/mextr/snpxtr" but there are no such commands. There is also no special information from "gemBS -h" help page. All commands it shows are typical ones like "prepare", "map", etc.

For instance, say I have a file named "region.bed" which contains 3 columns: "chr1 | 10000 | 20000". How do I include this file in snp-calling steps to force gemBS to call SNPs only inside those regions? I would appreciate if you could give us an example on how the command should look like.

Take care and stay safe on the covid-19 spread.

Thanks, Andy

Mar 19 '20 01:03 AndyWangSFU

Hi Simon,

I checked the underlying tools of mextr and snpxtr and found that there are options for input a region files, like:

-r, --regions restrict to comma separated list of regions -R, --regions-file restrict to regions listed in file

However, for the most important step bs_call (at present, we stuck here due to disk issues), I did not see such parameters. There is only:

/usr/local/lib/python3.8/site-packages/gemBS/gemBSbinaries/bs_call -h USE: ./bs_call [ARGS]... [Operations] --haploid|-1 Assume genome is haploid --keep-duplicates|-d Don't merge duplicate reads --ignore-duplicates|-e Ignore duplicate flag from SAM/BAM files --keep-unmatched|-k Don't discard reads that don't form proper pairs --right-trim|-R Bases to trim from right of read pair --left-trim|-L Bases to trim from left of read pair --blank-trim|-B Don't use trimmed bases for genotype estimation --mapq-threshold|-q Set MAPQ threshold for selecting reads (default 20) --bq-threshold|-Q Set base quality threshold for calling (default 20) --max-template-length|-l Set maximum template length for a pair (default 1000) [I/O] --output-type|-O <b|u|z|v|> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] --report-file <output JSON file name> --output|-o --sample|-n SAMPLE --reference|-r (MultiFASTA/FASTA) --contig-bed|-C (BED) --contig-sizes|-s --dbsnp|-D (dbSNP processed file) --all-positions|-A --benchmark-mode [Model] --conversion|-c , Set under and over conversion rates (default 0.01,0.05) --reference-bias Set bias to reference homozygote (default 2) [Misc] --verbose|-v --threads|-t | ,, Set additional threads for calculation, input and output --help|-h

Is there anything I miss? For now, I am going to find a smaller sample and see how it goes for gemBS call. (Hopefully the call result is not that large).

Thanks and take care with the covid19 mess.

Best regards, Andy

Mar 21 '20 00:03 AndyWangSFU

You need the --contig-bed option - you can supply a bed file with regions and just those regions will be called.

Simon

On Sat, Mar 21, 2020 at 1:05 AM Andy Wang [email protected] wrote:

Hi Simon,

I checked the underlying tools of mextr and snpxtr and found that there are options for input a region files, like:

-r, --regions restrict to comma separated list of regions -R, --regions-file restrict to regions listed in file

However, for the most important step bs_call (as we stuck here now), I did not see such parameters. There is only:

/usr/local/lib/python3.8/site-packages/gemBS/gemBSbinaries/bs_call -h USE: ./bs_call [ARGS]... [Operations] --haploid|-1 Assume genome is haploid --keep-duplicates|-d Don't merge duplicate reads --ignore-duplicates|-e Ignore duplicate flag from SAM/BAM files --keep-unmatched|-k Don't discard reads that don't form proper pairs --right-trim|-R Bases to trim from right of read pair --left-trim|-L Bases to trim from left of read pair --blank-trim|-B Don't use trimmed bases for genotype estimation --mapq-threshold|-q Set MAPQ threshold for selecting reads (default 20) --bq-threshold|-Q Set base quality threshold for calling (default 20) --max-template-length|-l Set maximum template length for a pair (default 1000) [I/O] --output-type|-O <b|u|z|v|> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] --report-file --output|-o --sample|-n SAMPLE --reference|-r (MultiFASTA/FASTA) --contig-bed|-C (BED) --contig-sizes|-s --dbsnp|-D (dbSNP processed file) --all-positions|-A --benchmark-mode [Model] --conversion|-c , Set under and over conversion rates (default 0.01,0.05) --reference-bias Set bias to reference homozygote (default 2) [Misc] --verbose|-v --threads|-t | ,, Set additional threads for calculation, input and output --help|-h

Unfortunately, we are facing disk issues and stuck in the "gemBS call" step. Is there anything I miss?

Best regards, Andy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heathsc/gemBS/issues/72#issuecomment-601960003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY4657LPX7YZWCAD4PHND3RIQAC7ANCNFSM4LMZKL5Q .

Mar 21 '20 06:03 heathsc

Hi Simon,

I haven't tried the underlying tools yet. I am using a smaller snp sample to test. The disk is not an issue. However, I get an error saying something like "Chromosome blocks not continuous":

[E::hts_idx_push] Chromosome blocks not continuous index: failed to create index for "/home/awang01/gemBS1/test1/calls/CDMP9ANXX_8_CCGTCC/CDMP9ANXX_8_CCGTCC.bcf" 2020-03-24 16:20:16,293 DEBUG: Process '/usr/local/lib/python3.8/site-packages/gemBS/bin/bcftools' finished with 255 2020-03-24 16:20:16,293 ERROR: Process '/usr/local/lib/python3.8/site-packages/gemBS/bin/bcftools' finished with 255 Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib64/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/local/lib/python3.8/site-packages/gemBS/init.py", line 1119, in run bsConcat(list_bcfs, sample, self.bsCall.merge_threads, fname, self.benchmark_mode) File "/usr/local/lib/python3.8/site-packages/gemBS/init.py", line 1314, in bsConcat raise ValueError("Error while Indexing BCF file.") ValueError: Error while Indexing BCF file. : Methylation call done, samples performed: CDMP9ANXX_8_CCGTCC

I checked the error file (I also attached it) that gemBS returns, and probably found where goes wrong:

Concatenating /home/awang01/gemBS1/test1/calls/CDMP9ANXX_8_CCGTCC/CDMP9ANXX_8_CCGTCC_chr15.bcf 58.088843 seconds Concatenating /home/awang01/gemBS1/test1/calls/CDMP9ANXX_8_CCGTCC/CDMP9ANXX_8_CCGTCC_chr15_KI270905v1_alt.bcf

It looks like gemBS mistakenly put an alternative chromosome into a regular one and created an individual "pool" for it. The alternative chromosomes are supposed to be combined in "pools".

I just substitute the .bam file with a smaller one and did not rerun the "gemBS index" step. I suppose the index files you created before could be used directly. Do you know how to fix this issue?

bcf_concat_CDMP9ANXX_8_CCGTCC.txt

Thanks, Andy

Mar 24 '20 23:03 AndyWangSFU

gemBS gemBS copied to clipboard

Specify target regions for snp calling && Extract SNPs only within target regions

Andy Wang

gemBS
gemBS copied to clipboard