ABC-Enhancer-Gene-Prediction
ABC-Enhancer-Gene-Prediction copied to clipboard
difficulty reproducing the ABC score of K562's in Nature 2021
Hi, We are trying to reproduce the ABC scores of the K562s (Fulco2019) scores in Nature 2021. We keep getting values that are somewhat correlated but not quite. the first step of candidate enhancer coordinates are also not matching up. We also tried forgoing step 1 and using Fulco2019 enhancer candidate regions instead, but we still were not able to get the same scores. Could you look at our commands and see what we are doing wrong? Thank you for you help.
#juicer_tools
wget http://hicfiles.tc4ga.com.s3.amazonaws.com/public/juicer/juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar
ln -s juicer_tools.1.7.5_linux_x64_jcuda.0.8.jar juicer_tools.jar
#gather data
cd example_wg/input_data
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562AlnRep1.bam
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeUwDnase/wgEncodeUwDnaseK562AlnRep2.bam
wget https://www.encodeproject.org/files/ENCFF384ZZM/@@download/ENCFF384ZZM.bam
conda activate final-abc-env
samtools index wgEncodeUwDnaseK562AlnRep1.bam
samtools index wgEncodeUwDnaseK562AlnRep2.bam
samtools index ENCFF384ZZM.bam
cd ../..
cp example_chr22/input_data/Expression/K562.ENCFF934YBO.TPM.txt example_wg/input_data/
#Download hic matrix file from juicebox
python src/juicebox_dump.py \
--hic_file https://hicfiles.s3.amazonaws.com/hiseq/k562/in-situ/combined_30.hic \
--juicebox "java -jar juicer_tools.jar" \
--outdir example_wg/input_data/HiC/raw/
#Fit HiC data to powerlaw model and extract parameters
python src/compute_powerlaw_fit_from_hic.py \
--hicDir example_wg/input_data/HiC/raw/ \
--outDir example_wg/input_data/HiC/raw/powerlaw/ \
--maxWindow 1000000 \
--minWindow 5000 \
--resolution 5000
#Define candidate enhancer regions
conda activate macs-py2.7
macs2 callpeak -t example_wg/input_data/wgEncodeUwDnaseK562AlnRep1.bam -n wgEncodeUwDnaseK562AlnRep1.macs2 -f BAM -g hs -p .1 --call-summits --outdir example_wg/ABC_output/Peaks/
conda activate final-abc-env
bedtools sort -faidx reference/chr_sizes -i example_wg/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.macs2_peaks.narrowPeak > example_wg/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.macs2_peaks.narrowPeak.sorted
python src/makeCandidateRegions.py \
--narrowPeak example_wg/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.macs2_peaks.narrowPeak.sorted \
--bam example_wg/input_data/wgEncodeUwDnaseK562AlnRep1.bam \
--outDir example_wg/ABC_output/Peaks/ \
--chrom_sizes reference/chr_sizes \
--regions_blocklist reference/wgEncodeHg19ConsensusSignalArtifactRegions.bed \
--regions_includelist reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.bed \
--peakExtendFromSummit 250 \
--nStrongestPeaks 150000
#Quantifying Enhancer Activity
python src/run.neighborhoods.py \
--candidate_enhancer_regions example_wg/ABC_output/Peaks/wgEncodeUwDnaseK562AlnRep1.macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.bed \
--H3K27ac example_wg/input_data/ENCFF384ZZM.bam \
--DHS example_wg/input_data/wgEncodeUwDnaseK562AlnRep1.bam,example_wg/input_data/wgEncodeUwDnaseK562AlnRep2.bam \
--expression_table example_wg/input_data/K562.ENCFF934YBO.TPM.txt \
--chrom_sizes reference/chr_sizes \
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt \
--cellType K562 \
--outdir example_wg/ABC_output/Neighborhoods/
#Computing the ABC Score
python src/predict.py \
--enhancers example_wg/ABC_output/Neighborhoods/EnhancerList.txt \
--genes example_wg/ABC_output/Neighborhoods/GeneList.txt \
--HiCdir example_wg/input_data/HiC/raw/ \
--chrom_sizes reference/chr_sizes \
--hic_resolution 5000 \
--scale_hic_using_powerlaw \
--threshold .02 \
--cellType K562 \
--outdir example_wg/ABC_output/Predictions/ \
--make_all_putative \
--chromosome chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX
The neighborhoods and predictions should have several values stored in them (base activity, DHS, H3K27ac, HiC contact). Can you tell which values are diverging?
First the candidate enhancer coordinates are slightly shifted, so I can't compare the results directly if I run from step1. Below the first coordinates are from Fulco2019.STable6a, the second are the coordinates that I get from running step 1.
chrX 49969065 49969605 chrX 49968988 49969615 540
chrX 50191545 50192045 chrX 50191430 50191930 385
chrX 50191545 50192045 chrX 50191430 50191930 385
chrX 50191545 50192045 chrX 50191430 50191930 385
chrX 50191545 50192045 chrX 50191430 50191930 385
chrX 50322885 50323405 chrX 50322906 50323406 499
chrX 50322885 50323405 chrX 50322906 50323406 499
chrX 50322885 50323405 chrX 50322906 50323406 499
chrX 50322885 50323405 chrX 50322906 50323406 499
chrX 55054662 55054701 chrX 55054464 55054964 39
So, to make the results comparable, I take the Fulco2019 Stable6 coordinates and replace any of my candidates that overlap with the coordinates from Fulco2019. the comparisons are based on these candidates with overlaps replaced. It looks like the hiC scores are a bit less consistent compare to others. they are generally all well correlated but combined, the final abc only correlated by .88 Here are the general comparisons across the intermediate scores for those reported in Stable6:
> summary(compare)
chr.start.end_Gene Normalized_HiC_Contacts
chr1:29210606-29210645_EPB41 : 1 Min. : 0.1808
chr1:29211591-29211630_EPB41 : 1 1st Qu.: 1.2750
chr1:3691278-3691778_CEP104 : 1 Median : 3.0222
chr1:3691278-3691778_LRRC47 : 1 Mean : 6.9145
chr1:3691278-3691778_SMIM1 : 1 3rd Qu.: 7.1418
chr10:127505227-127505304_UROS: 1 Max. :100.0000
(Other) :5079
DHS.RPM. H3K27ac.RPM. Activity ABC_Score
Min. : 0.02231 Min. : 0.0000 Min. : 0.0000 Min. :0.0000000
1st Qu.: 1.32220 1st Qu.: 0.4094 1st Qu.: 0.8527 1st Qu.:0.0001706
Median : 2.70907 Median : 1.4331 Median : 1.9162 Median :0.0007154
Mean : 5.91887 Mean : 4.2879 Mean : 4.4218 Mean :0.0043124
3rd Qu.: 6.91047 3rd Qu.: 4.7086 3rd Qu.: 5.7820 3rd Qu.:0.0029054
Max. :78.25537 Max. :125.5971 Max. :98.8760 Max. :0.3227859
hic_contact DHS.RPM H3K27ac.RPM activity_base
Min. :0.000000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
1st Qu.:0.000418 1st Qu.: 0.8789 1st Qu.: 0.4094 1st Qu.: 0.6994
Median :0.001285 Median : 1.7777 Median : 1.4329 Median : 1.5615
Mean :0.003158 Mean : 3.7926 Mean : 4.2764 Mean : 3.5277
3rd Qu.:0.003174 3rd Qu.: 4.4390 3rd Qu.: 4.8103 3rd Qu.: 4.6668
Max. :0.068595 Max. :50.1438 Max. :125.5808 Max. :79.3542
ABC.Score
Min. :0.000000
1st Qu.:0.000312
Median :0.000993
Mean :0.003518
3rd Qu.:0.003118
Max. :0.229734
> cor(compare$Normalized_HiC_Contacts, compare$hic_contact)
[1] 0.9589006
> cor(compare$DHS.RPM., compare$DHS.RPM)
[1] 0.983397
> cor(compare$H3K27ac.RPM., compare$H3K27ac.RPM)
[1] 0.9625272
> cor(compare$Activity, compare$activity_base)
[1] 0.9648836
> cor(compare$ABC_Score, compare$ABC.Score)
[1] 0.8818009
here are the pairwise plots if it helps. https://drive.google.com/file/d/1uzIv9cN6GxZKQVOaaZvwOqsRpniaO9iw/view?usp=sharing
I believe that the results from Fulco 2019 cannot be directly reproduced from the main branch of the code. I believe that Joe Nasser explained why in previous help thread in the github.
If you want to reproduce the Fulco 2019 results, there is a separate branch: https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/tree/NG2019
Otherwise, we recommend using the main branch.
On Sat, Sep 25, 2021 at 11:11 AM mirahan @.***> wrote:
here are the pairwise plots if it helps.
https://drive.google.com/file/d/1uzIv9cN6GxZKQVOaaZvwOqsRpniaO9iw/view?usp=sharing http://url
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/issues/60#issuecomment-927161493, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3LGEDW426L2E55UU5YYVTUDYGFDANCNFSM5EW4E5FQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Jesse Engreitz, PhD Assistant Professor, Department of Genetics BASE Research Initiative, Betty Irene Moore Children’s Heart Center Stanford University School of Medicine www.engreitzlab.org +1 (206) 310-3935