MACS
MACS copied to clipboard
MACS2 stalls pre-computing pvalue-qvalue table..
Hello,
MACS2 stalls at the following line:
#3 Pre-compute pvalue-qvalue table...
I have two BAM files, one converted to BAM from Bowtie2 output and the other which has underwent some processing. Here is an example alignment from the raw BAM file:
HISEQ:127:C7HJ7ANXX:1:1312:20025:49343 163 chr1 10069 30 50M = 10075 56 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 XS:i:0 YS:i:0 YT:Z:CP
and here is an example from the processed BAM file:
HISEQ:127:C7HJ7ANXX:1:1312:20025:49343/1 16 chr1 10070 255 50M * 0 0 * *
MACS2 stalls when I input the processed BAM file but not the raw file. The processed BAM file was created by converting the raw BAM to a BED file, processing the data with AWK, and then converting back to a BAM file (both conversions done using BEDtools).
I call peaks using the following command:
macs2 callpeak -t sample.bam -f BAMPE -g hs -s 50 -n sample -q 0.01 --nomodel --nolambda --keep-dup all --call-summits
Thank you
Is it related to #86?
I confirm there is still a problem in current master (9ddacb0). Using the following SAM input file:
FCC5W8BACXX:1:1101:16503:36846#ATCCAAGC/2 16 chr2L 22700622 44 100M * 0 0 CGAGATCAACGCACAACTCGAGGTCGAAATCGAATTCGGCAAAAGACGACTTCGCATAAACATCCTAATACTACCAGGGGTGGAATTTCCTCACCCCGAA ^b`_YJTG[ab^YSGS]a^Ya^^[X]_]]ca_\V^cdcdeedddef_\Zeabgggcfbgece_aaehgcaaaYdffaghgbfde_^cgeebWO`ccc__^ AS:i:200 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YT:Z:UU
and the following command line:
macs2 callpeak --outdir delme --name macs2 --format SAM --nomodel --treatment input.sam
macs2 get stalled at step "Pre-compute pvalue-qvalue table".
I found this thread when troubleshooting run times for jobs that I had running, where MACS2 was part of an analysis pipeline to process ChIPseq data. I looked at the error files as the jobs had been running for several days, which was odd. Tail of the error files showed the jobs were hung at the same step in MACS2 reported here - issue #101 and #86.
INFO @ Fri, 29 Jul 2022 10:03:32: #3 Call peaks... INFO @ Fri, 29 Jul 2022 10:03:32: #3 Pre-compute pvalue-qvalue table...
I've tried using macs2 --version 2.2.6 and 2.2.7.1 using a HPC grid, and tried using 1,4, and 8 slots, expanding buffer, requested nodes with > 10GB RAM
The odd part is that when I ran MACS2 with all default parameters the jobs finished in a few hours. macs2 callpeak -f BAM -t /pathtomyfile/hisat2.sorted.bam -c /pathtomyfile/control_hisat2.sorted.bam -g 1e-9 -n sample1
However, when I add the the following parameters the program is hung at the indicated steps: --nomodel --extsize 147
or
--nomodel --extsize 147 --broad --broad-cutoff 0.1
Single threads being used seem to be computing efficiently at 100%, but the jobs simply spin for up to 7 days. These run times seem way too long, as file read in for macs2 is super fast - 8.5G BAM files load and are read in (steps 1-3) ~ 10 minutes.
So, this seems like a bug rather than a compute environment issues. Do you have suggestions on potential fixes?
One other note - I recently used the same versions of MACS2 on the same compute system on ATACseq data and used similar --no-model parameters, but only had treatment files, no control input files. These jobs used even larger files and finished within 12-24 hours.
A follow up regarding this issue - I found that I get different results with different files. I tested pre-processed BED files from SRA - also ChIPseq data, and macs2 or macs3 completed peak calling in about 30 minutes, no issues. However, I downloaded the raw data from SRA and processed with my pipeline and I get the same issue - the process hangs at Pre-compute pvalue-quvalue table.. My pipeline uses trimmomatic to trim illimina adaptors, hisat2 alignment (-p 8 -k 1 --no-spliced-alignment), bamsormadup to remove duplicates and sort bam output. macs2 doesn't throw and error, but just hangs. Any ideas would be much appreciated.
@jcohn42 Is it possible that the raw data processed from SRA contains many random chromosomes? The process hangs at p-q table calculation is an old issue I dealt with years ago where when the hash table (we used khash) becomes too big. Please let me know how it goes with macs3 since we are working on it. Another fix/idea I got is that maybe we should provide an option to skip converting p-value to q-value entirely to save time because people right now have better ways to control the false positive rate such as IDR on ranked peaks based on p-value or foldchange.
@taoliu, yes that is right, the SRA data does have many chromosomes (10 maize chromosomes and a few dozed contigs). They are present in the pre-processed BED files downloaded from SRA, I used the same reference genome for aligning the raw data using hisat2, so the bam files resulting have the same chromosomes. The only major difference I noted is that the external BED files seem to have been pre-filtered for alignment quality; $cut -5
@jcohn42 Could you share me the SRA accession number so that I can give it a try?
@taoliu The files come from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA492464 GSM3397995 Chip B73 leaf H3K56ac rep1 SRR7889772 GSM3398001 Chip B73 leaf input rep1 SRR7889778
Here is my macs3 call with the processed BED files that worked beautifully, completing within 30 minutes macs3 callpeak -f BED -t /scratch-large/5-biannual/COH3780_chip/TEST_ext/GSM3397995_mapped_chip_B73_leaf_H3K56ac_rep1.bed -c /scratch-large/5- biannual/COH3780_chip/TEST_ext/GSM3398001_mapped_chip_B73_leaf_input_rep1.bed -g 2100000000 -n TESText --outdir /scratch-large/5-biannual/COH3780_chip/TEST_ext --nomodel --extsize 147
FASTQ files were aligned to B73_v4 genome, an internal copy, but should be very similar to B73 RefGen_v4, AGPv4, available from here: https://www.maizegdb.org/assembly. Note, there is a newer genome, but this was the genome used to generate the processed files tested.
I used the same macs3 call, with the change "-f BAM" and -t <deduplicated, sorted bam files for test> -c <deduplicated, sorted bam file for input". No errors, but still paused on #3 Pre-compute pvalue-qvalue table..., running now ~ 20 hours.
Do you have recommendations for buffer size, # slots, node parameters, etc? I've requested > 10G RAM, which should be fine, I think?
@taoliu, can you clarify if the problem you described was 'fixed' "The process hangs at p-q table calculation is an old issue I dealt with years ago where when the hash table (we used khash) becomes too big." What was the fix, is there some work around, like adjusting input file sizes, processing one chromosome at a time, etc, that I could use to get MACS2 (or MACS3) working for my data? Are there particular filters of aligned reads that should be used?
Thanks for your recommendations and help.
@taoliu I discovered the problem - it was an issue with my genome size input - a typo! Thanks again for your replies; btw, MACS3 is also working quite well with my data and the external data. Thanks for providing a very useful tool!