ABC-Enhancer-Gene-Prediction KeyError(key) in run.neighborhoods.py

So, all of the data I'm using is public. I'm trying to create customized ABC regions for lung data. Here is the error I encounter, along with the script I used to generate it:


cd ~/ABC-Enhancer-Gene-Prediction/
cd lung
mkdir DNase
cd DNase
wget https://www.encodeproject.org/files/ENCFF023JFG/@@download/ENCFF023JFG.bam
#sort and index
samtools sort -l 1 -m 7G -o IMR90_DNASE_ENCFF023JFG.sorted.bam -O bam -@ 32 ENCFF023JFG.bam
samtools index -@ 32 IMR90_DNASE_ENCFF023JFG.sorted.bam
cd ..
mkdir acetyl
cd acetyl
wget https://www.encodeproject.org/files/ENCFF146UYU/@@download/ENCFF146UYU.bam
wget https://www.encodeproject.org/files/ENCFF071VOI/@@download/ENCFF071VOI.bam
samtools merge IMR90_H3K27ME3_reps_1_2.bam *.bam
cd ..
cd ..

mkdir -p ./lung/ABC_output/Peaks/
cd reference
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
gunzip *.*
mv ENCFF356LFX.bed hg38ENCODEblacklist.bed
./liftOver RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.bed hg19ToHg38.over.chain RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg19tohg38unmapped.bed
macs2 callpeak \
-t lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
-n IMR90_DNASE_ENCFF023JFG.macs2 \
-f BAM \
-g hs \
-p .1 \
--call-summits \
--outdir lung/ABC_output/Peaks/
python3 src/makeCandidateRegions.py \
--narrowPeak lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak \
--bam lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
--outDir lung/ABC_output/Peaks/ \
--chrom_sizes reference/hg38.chrom.sizes \
--regions_blacklist reference/hg38ENCODEblacklist.bed \
--regions_whitelist reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed \
--peakExtendFromSummit 250 \
--nStrongestPeaks 150000 

./reference/liftOver  ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.bed ./reference/hg19ToHg38.over.chain ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg19tohg38unmapped.bed
mkdir -p ./lung/ABC_output/Neighborhoods/
mkdir -p ./lung/rnaseq
cd ./lung/rnaseq
wget https://www.encodeproject.org/files/ENCFF833OTW/@@download/ENCFF833OTW.tsv
wget https://www.encodeproject.org/documents/a4a6caad-61ab-4d35-820a-409cadce1121/@@download/attachment/RSEM_quantifications_specifications.txt
awk '$1 = $1 FS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed
cd ../..
python3 src/run.neighborhoods.py \
--candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed \
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed \
--H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam \
--DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
--chrom_sizes reference/hg38.chrom.sizes \
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt \
--cellType IMR90 \
--outdir lung/ABC_output/Neighborhoods/

Error:

~/ABC-Enhancer-Gene-Prediction$ python3 src/run.neighborhoods.py \
> --candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed \
> --genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed \
> --H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam \
> --DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam \
> --chrom_sizes reference/hg38.chrom.sizes \
> --ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt \
> --cellType IMR90 \
> --outdir lung/ABC_output/Neighborhoods/
Namespace(ATAC='', DHS='lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam', H3K27ac='lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam', candidate_enhancer_regions='lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed', cellType='IMR90', chrom_sizes='reference/hg38.chrom.sizes', default_accessibility_feature=None, enhancer_class_override=None, expression_table='', gene_name_annotations='symbol', genes='reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed', genes_for_class_assignment=None, outdir='lung/ABC_output/Neighborhoods/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes='reference/UbiquitouslyExpressedGenesHG19.txt', use_secondary_counting_method=False)
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'start'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "src/run.neighborhoods.py", line 97, in <module>
    main(args)
  File "src/run.neighborhoods.py", line 93, in main
    processCellType(args)
  File "src/run.neighborhoods.py", line 74, in processCellType
    outdir = args.outdir)
  File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 85, in annotate_genes_with_features
    tss1kb = make_tss_region_file(genes, outdir, genome_sizes)
  File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 111, in make_tss_region_file
    sizes_pr = df_to_pyranges(read_bed(sizes + '.bed'))
  File "/home/ubuntu/ABC-Enhancer-Gene-Prediction/src/tools.py", line 53, in df_to_pyranges
    df['Start'] = df[start_col] - start_slop
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2902, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
    raise KeyError(key) from err
KeyError: 'start'

Sep 10 '20 11:09 jamesdalg

Looks like the chromosome sizes have mixed space-and-tabs. I think changing your awk command to this will fix it: awk -v OFS='\t' '$1 = $1 OFS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed

Sep 15 '20 14:09 thouis

So, I've updated the script: ``

cd ~/ABC-Enhancer-Gene-Prediction/ cd lung mkdir DNase cd DNase wget https://www.encodeproject.org/files/ENCFF023JFG/@@download/ENCFF023JFG.bam #sort and index samtools sort -l 1 -m 7G -o IMR90_DNASE_ENCFF023JFG.sorted.bam -O bam -@ 32 ENCFF023JFG.bam samtools index -@ 32 IMR90_DNASE_ENCFF023JFG.sorted.bam cd .. mkdir acetyl cd acetyl wget https://www.encodeproject.org/files/ENCFF146UYU/@@download/ENCFF146UYU.bam wget https://www.encodeproject.org/files/ENCFF071VOI/@@download/ENCFF071VOI.bam samtools merge IMR90_H3K27ME3_reps_1_2.bam *.bam -@ 6 samtools index IMR90_H3K27ME3_reps_1_2.bam -@ 6 cd .. cd ..

mkdir -p ./lung/ABC_output/Peaks/ cd reference wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz gunzip . mv ENCFF356LFX.bed hg38ENCODEblacklist.bed ./liftOver RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.bed hg19ToHg38.over.chain RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg19tohg38unmapped.bed macs2 callpeak
-t lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam
-n IMR90_DNASE_ENCFF023JFG.macs2
-f BAM
-g hs
-p .1
--call-summits
--outdir lung/ABC_output/Peaks/ python3 src/makeCandidateRegions.py
--narrowPeak lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak
--bam lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam
--outDir lung/ABC_output/Peaks/
--chrom_sizes reference/hg38.chrom.sizes
--regions_blacklist reference/hg38ENCODEblacklist.bed
--regions_whitelist reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.TSS500bp.hg38.bed
--peakExtendFromSummit 250
--nStrongestPeaks 150000

./reference/liftOver ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.bed ./reference/hg19ToHg38.over.chain ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed ./reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg19tohg38unmapped.bed mkdir -p ./lung/ABC_output/Neighborhoods/ mkdir -p ./lung/rnaseq cd ./lung/rnaseq wget https://www.encodeproject.org/files/ENCFF833OTW/@@download/ENCFF833OTW.tsv wget https://www.encodeproject.org/documents/a4a6caad-61ab-4d35-820a-409cadce1121/@@download/attachment/RSEM_quantifications_specifications.txt awk -v OFS='\t' '$1 = $1 OFS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed cd ../.. python3 src/run.neighborhoods.py
--candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed
--H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam
--DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam
--chrom_sizes reference/hg38.chrom.sizes
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt
--cellType IMR90
--outdir lung/ABC_output/Neighborhoods/ ``

This yields the error: ~/ABC-Enhancer-Gene-Prediction$ python3 src/run.neighborhoods.py \

--candidate_enhancer_regions lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed
--genes reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed
--H3K27ac lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam
--DHS lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam
--chrom_sizes reference/hg38.chrom.sizes
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenesHG19.txt
--cellType IMR90
--outdir lung/ABC_output/Neighborhoods/ Namespace(ATAC='', DHS='lung/DNase/IMR90_DNASE_ENCFF023JFG.sorted.bam', H3K27ac='lung/acetyl/IMR90_H3K27ME3_reps_1_2.bam', candidate_enhancer_regions='lung/ABC_output/Peaks/IMR90_DNASE_ENCFF023JFG.macs2_peaks.narrowPeak.candidateRegions.bed', cellType='IMR90', chrom_sizes='reference/hg38.chrom.sizes', default_accessibility_feature=None, enhancer_class_override=None, expression_table='', gene_name_annotations='symbol', genes='reference/RefSeqCurated.170308.bed.CollapsedGeneBounds.hg38.bed', genes_for_class_assignment=None, outdir='lung/ABC_output/Neighborhoods/', primary_gene_identifier='symbol', qnorm=None, skip_gene_counts=False, skip_rpkm_quantile=False, supplementary_features=None, tss_slop_for_class_assignment=500, ubiquitously_expressed_genes='reference/UbiquitouslyExpressedGenesHG19.txt', use_secondary_counting_method=False) Running command: bedtools sort -faidx reference/hg38.chrom.sizes -i lung/ABC_output/Neighborhoods/GeneList.TSS1kb.bed > lung/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted; mv lung/ABC_output/Neighborhoods/GeneList.TSS1kb.bed.sorted lung/ABC_output/Neighborhoods/GeneList.TSS1kb.bed Loading coverage from pre-calculated file for Genes.H3K27ac.IMR90_H3K27ME3_reps_1_2.bam Traceback (most recent call last): File "src/run.neighborhoods.py", line 97, in main(args) File "src/run.neighborhoods.py", line 93, in main processCellType(args) File "src/run.neighborhoods.py", line 69, in processCellType annotate_genes_with_features(genes = genes, File "/home/tardigrade/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 89, in annotate_genes_with_features genes = count_features_for_bed(genes, bounds_bed, genome_sizes, features, outdir, "Genes", force=force, use_fast_count=use_fast_count) File "/home/tardigrade/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 415, in count_features_for_bed df = count_single_feature_for_bed(df, bed_file, genome_sizes, feature_bam, feature, directory, filebase, skip_rpkm_quantile, force, use_fast_count) File "/home/tardigrade/ABC-Enhancer-Gene-Prediction/src/neighborhoods.py", line 448, in count_single_feature_for_bed assert df.shape[0] == orig_shape, "Dimension mismatch" AssertionError: Dimension mismatch

Sep 17 '20 02:09 jamesdalg

Looks like the chromosome sizes have mixed space-and-tabs. I think changing your awk command to this will fix it: awk -v OFS='\t' '$1 = $1 OFS "0"' ./reference/hg38.chrom.sizes > ./reference/hg38.chrom.sizes.bed

Thanks for catching that. Seems like it helped. Now at least I get a new error! :)

It seems my error now resolves to be perhaps identical to issue #31

Sep 17 '20 02:09 jamesdalg

Could you try checking out the use_pysam branch and see if that fixes this problem? It did in my copy of your script and data.

You'll need to pip install pysam, if you haven't already.

Sep 17 '20 13:09 thouis

We've revamped the codebase. Please check out https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/tree/main and reopen your issue if it still exists

Dec 07 '23 23:12 atancoder

ABC-Enhancer-Gene-Prediction ABC-Enhancer-Gene-Prediction copied to clipboard

KeyError(key) in run.neighborhoods.py

ABC-Enhancer-Gene-Prediction
ABC-Enhancer-Gene-Prediction copied to clipboard