cooler
cooler copied to clipboard
BadInputError: Found a bin ID that exceeds the declared number of bins
Hi, I'm having trouble loading a COO file from here: GSE116694.
I downloaded GSM3258549_SW480_40Kb.fullMatrix.txt.gz and tried with this command:
cooler load -f coo --field count:dtype=float GSE116694_hg19_40kb.txt GSM3258549_SW480_40Kb.fullMatrix.txt GSM3258549_SW480_40Kb.cool
But got what the title says.
I also tried to build my own chromsizes.tsv (attached) that matches the data in the bed file GSE116694_hg19_40kb.txt:
cooler load -f coo --field count:dtype=float chromsizes.tsv:40000 GSM3258549_SW480_40Kb.fullMatrix.txt GSM3258549_SW480_40Kb.cool
And I got the same error until I discovered that if I add a new chromosome to the chromsizes.tsv file like this:
...
chrX 155270560
chrY 59373566
chrM 16571
fake 1
It works. What's happening here? Thank you.
Versions: Python 3.7.0 cooler, version 0.8.11 [GCC 6.4.0] on linux chromsizes.txt
Perhaps the bin IDs they used are 1-based instead of 0-based. Does this work using the original chromsizes?
cooler load -f coo --one-based --field count:dtype=float chromsizes.tsv:40000 GSM3258549_SW480_40Kb.fullMatrix.txt GSM3258549_SW480_40Kb.cool
No, same result, but thanks for the help. I continued to work with the added fake chromosome as the results seems fine, so it's not an urgent issue anymore even though I would still like to know the cause.
As a follow up to close this, I am able to successfully load the COO matrix on the exact input file you used by using --one-based. No need for the fake chromosome.
# generate chromsizes
python -c 'import bioframe as bf; bf.assembly_info("hg19").chromsizes.to_csv("hg19.chrom.sizes", sep="\t", header=False)'
# load the matrix
cooler load -f coo --one-based --field count:dtype=float hg19.chrom.sizes:40000 GSM3258549_SW480_40Kb.fullMatrix.txt.gz test.40000.cool
Binning hg19 at 40kb produces 77404 bins. With 0-based IDs, the largest bin ID would be 77403, but a simple search shows that the largest bin ID in GSM3258549_SW480_40Kb.fullMatrix.txt.gz is 77404 (corresponding to all of chrM), which triggered the original error, so the bin IDs must be encoded as 1-based.