ABC-Enhancer-Gene-Prediction
ABC-Enhancer-Gene-Prediction copied to clipboard
KeyError(key) in run.neighborhoods.py
So, all of the data I'm using is public. I'm trying to create customized ABC regions for lung data. Here is the error I encounter, along with the script I used to generate it:
cd ~/ABC-Enhancer-Gene-Prediction/
cd lung
mkdir DNase
cd DNase
wget https://www.encodeproject.org/files/ENCFF023JFG/@@download/ENCFF023JFG.bam
#sort and index
samtools sort -l 1 -m 7G -o IMR90_DNASE_ENCFF023JFG.sorted.bam -O bam -@ 32 ENCFF023JFG.bam
samtools index -@ 32 IMR90_DNASE_ENCFF023JFG.sorted.bam
cd ..
mkdir acetyl
cd acetyl
wget https://www.encodeproject.org/files/ENCFF146UYU/@@download/ENCFF146UYU.bam
wget https://www.encodeproject.org/files/ENCFF071VOI/@@download/ENCFF071VOI.bam
samtools merge IMR90_H3K27ME3_reps_1_2.bam *.bam
cd ..
cd ..
mkdir -p ./lung/ABC_output/Peaks/
cd reference
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
gunzip *.*
mv ENCFF356LFX.bed hg38ENCODEblacklist.bed
./liftOver RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.bed hg19ToHg38.over.chain RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg19tohg38unmapped.bed
macs2 callpeak \
-t lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
-n IMR90_DNASE_ENCFF023JFG.macs2 \
-f BAM \
-g hs \
-p .1 \
--call-summits \
--outdir lung/ABC_output/Peaks/
python3 src/makeCandidateRegions.py \
--narrowPeak lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak \
--bam lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
--outDir lung/ABC_output/Peaks/ \
--chrom_sizes reference/hg38.chrom.sizes \
--regions_blacklist reference/hg38ENCODEblacklist.bed \
--regions_whitelist reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed \
--peakExtendFromSummit 250 \
--nStrongestPeaks 150000
./reference/liftOver ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.bed ./reference/hg19ToHg38.over.chain ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg19tohg38unmapped.bed
mkdir -p ./lung/ABC_output/Neighborhoods/
mkdir -p ./lung/rnaseq
cd ./lung/rnaseq
wget https://www.encodeproject.org/files/ENCFF833OTW/@@download/ENCFF833OTW.tsv
wget https://www.encodeproject.org/documents/a4a6caad-61ab-4d35-820a-409cadce1121/@@download/attachment/RSEM_quantifications_specifications.txt
awk '$1 = $1 FS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed
cd ../..
python3 src/run.neighborhoods.py \
--candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed \
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed \
--H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam \
--DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
--chrom_sizes reference/hg38.chrom.sizes \
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt \
--cellType IMR90 \
--outdir lung/ABC_output/Neighborhoods/
Error:
~/ABC-Enhancer-Gene-Prediction$ python3 src/run.neighborhoods.py \
> --candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed \
> --genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed \
> --H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam \
> --DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
> --chrom_sizes reference/hg38.chrom.sizes \
> --ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt \
> --cellType IMR90 \
> --outdir lung/ABC_output/Neighborhoods/
Namespace(ATAC='', DHS='lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam', H3K27ac='lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam', candidate_enhancer_regions='lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed', cellType='IMR90', chrom_sizes='reference/hg38.chrom.sizes', default_accessibility_feature=None, enhancer_class_override=None, expression_table='', gene_name_annotations='symbol', genes='reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed', genes_for_class_assignment=None, outdir='lung/ABC_output/Neighborhoods/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes='reference/UbiquitouslyExpressedGenesHG19.txt', use_secondary_counting_method=False)
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'start'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "src/run.neighborhoods.py", line 97, in <module>
main(args)
File "src/run.neighborhoods.py", line 93, in main
processCellType(args)
File "src/run.neighborhoods.py", line 74, in processCellType
outdir = args.outdir)
File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 85, in annotate_genes_with_features
tss1kb = make_tss_region_file(genes, outdir, genome_sizes)
File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 111, in make_tss_region_file
sizes_pr = df_to_pyranges(read_bed(sizes + '.bed'))
File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/tools.py", line 53, in df_to_pyranges
df['Start'] = df[start_col] - start_slop
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
raise KeyError(key) from err
KeyError: 'start'
Looks like the chromosome sizes have mixed space-and-tabs. I think changing your awk command to this will fix it:
awk -v OFS='\t' '$1 = $1 OFS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed
So, I've updated the script: ``
cd ~/ABC-Enhancer-Gene-Prediction/ cd lung mkdir DNase cd DNase wget https://www.encodeproject.org/files/ENCFF023JFG/@@download/ENCFF023JFG.bam #sort and index samtools sort -l 1 -m 7G -o IMR90_DNASE_ENCFF023JFG.sorted.bam -O bam -@ 32 ENCFF023JFG.bam samtools index -@ 32 IMR90_DNASE_ENCFF023JFG.sorted.bam cd .. mkdir acetyl cd acetyl wget https://www.encodeproject.org/files/ENCFF146UYU/@@download/ENCFF146UYU.bam wget https://www.encodeproject.org/files/ENCFF071VOI/@@download/ENCFF071VOI.bam samtools merge IMR90_H3K27ME3_reps_1_2.bam *.bam -@ 6 samtools index IMR90_H3K27ME3_reps_1_2.bam -@ 6 cd .. cd ..
mkdir -p ./lung/ABC_output/Peaks/
cd reference
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
gunzip .
mv ENCFF356LFX.bed hg38ENCODEblacklist.bed
./liftOver RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.bed hg19ToHg38.over.chain RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg19tohg38unmapped.bed
macs2 callpeak
-t lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam
-n IMR90_DNASE_ENCFF023JFG.macs2
-f BAM
-g hs
-p .1
--call-summits
--outdir lung/ABC_output/Peaks/
python3 src/makeCandidateRegions.py
--narrowPeak lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak
--bam lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam
--outDir lung/ABC_output/Peaks/
--chrom_sizes reference/hg38.chrom.sizes
--regions_blacklist reference/hg38ENCODEblacklist.bed
--regions_whitelist reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed
--peakExtendFromSummit 250
--nStrongestPeaks 150000
./reference/liftOver ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.bed ./reference/hg19ToHg38.over.chain ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg19tohg38unmapped.bed
mkdir -p ./lung/ABC_output/Neighborhoods/
mkdir -p ./lung/rnaseq
cd ./lung/rnaseq
wget https://www.encodeproject.org/files/ENCFF833OTW/@@download/ENCFF833OTW.tsv
wget https://www.encodeproject.org/documents/a4a6caad-61ab-4d35-820a-409cadce1121/@@download/attachment/RSEM_quantifications_specifications.txt
awk -v OFS='\t' '$1 = $1 OFS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed
cd ../..
python3 src/run.neighborhoods.py
--candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed
--H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam
--DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam
--chrom_sizes reference/hg38.chrom.sizes
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt
--cellType IMR90
--outdir lung/ABC_output/Neighborhoods/
``
This yields the error: ~/ABC-Enhancer-Gene-Prediction$ python3 src/run.neighborhoods.py \
--candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed
--H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam
--DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam
--chrom_sizes reference/hg38.chrom.sizes
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt
--cellType IMR90
--outdir lung/ABC_output/Neighborhoods/ Namespace(ATAC='', DHS='lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam', H3K27ac='lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam', candidate_enhancer_regions='lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed', cellType='IMR90', chrom_sizes='reference/hg38.chrom.sizes', default_accessibility_feature=None, enhancer_class_override=None, expression_table='', gene_name_annotations='symbol', genes='reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed', genes_for_class_assignment=None, outdir='lung/ABC_output/Neighborhoods/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes='reference/UbiquitouslyExpressedGenesHG19.txt', use_secondary_counting_method=False) Running command: bedtools sort -faidx reference/hg38.chrom.sizes -i lung/ABC_output/Neighborhoods/GeneList.TSS1kb.bed > lung/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted; mv lung/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted lung/ABC_output/Neighborhoods/GeneList.TSS1kb.bed Loading coverage from pre-calculated file for Genes.H3K27ac.IMR90_H3K27ME3_reps_1_2.bam Traceback (most recent call last): File "src/run.neighborhoods.py", line 97, inmain(args) File "src/run.neighborhoods.py", line 93, in main processCellType(args) File "src/run.neighborhoods.py", line 69, in processCellType annotate_genes_with_features(genes = genes, File "/home/tardigrade/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 89, in annotate_genes_with_features genes = count_features_for_bed(genes, bounds_bed, genome_sizes, features, outdir, "Genes", force=force, use_fast_count=use_fast_count) File "/home/tardigrade/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 415, in count_features_for_bed df = count_single_feature_for_bed(df, bed_file, genome_sizes, feature_bam, feature, directory, filebase, skip_rpkm_quantile, force, use_fast_count) File "/home/tardigrade/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 448, in count_single_feature_for_bed assert df.shape[0] == orig_shape, "Dimension mismatch" AssertionError: Dimension mismatch
Looks like the chromosome sizes have mixed space-and-tabs. I think changing your awk command to this will fix it:
awk -v OFS='\t' '$1 = $1 OFS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed
Thanks for catching that. Seems like it helped. Now at least I get a new error! :)
It seems my error now resolves to be perhaps identical to issue #31
Could you try checking out the use_pysam
branch and see if that fixes this problem? It did in my copy of your script and data.
You'll need to pip install pysam, if you haven't already.
We've revamped the codebase. Please check out https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/tree/main and reopen your issue if it still exists