cooler icon indicating copy to clipboard operation
cooler copied to clipboard

BadInputError: Found a bin ID that exceeds the declared number of bins

Open iagooteroc opened this issue 4 years ago • 2 comments
trafficstars

Hi, I'm having trouble loading a COO file from here: GSE116694. I downloaded GSM3258549_SW480_40Kb.fullMatrix.txt.gz and tried with this command: cooler load -f coo --field count:dtype=float GSE116694_hg19_40kb.txt GSM3258549_SW480_40Kb.fullMatrix.txt GSM3258549_SW480_40Kb.cool But got what the title says.

I also tried to build my own chromsizes.tsv (attached) that matches the data in the bed file GSE116694_hg19_40kb.txt: cooler load -f coo --field count:dtype=float chromsizes.tsv:40000 GSM3258549_SW480_40Kb.fullMatrix.txt GSM3258549_SW480_40Kb.cool And I got the same error until I discovered that if I add a new chromosome to the chromsizes.tsv file like this:

...
chrX    155270560
chrY    59373566
chrM    16571
fake    1

It works. What's happening here? Thank you.

Versions: Python 3.7.0 cooler, version 0.8.11 [GCC 6.4.0] on linux chromsizes.txt

iagooteroc avatar Apr 16 '21 11:04 iagooteroc

Perhaps the bin IDs they used are 1-based instead of 0-based. Does this work using the original chromsizes?

cooler load -f coo --one-based --field count:dtype=float chromsizes.tsv:40000 GSM3258549_SW480_40Kb.fullMatrix.txt GSM3258549_SW480_40Kb.cool

nvictus avatar Apr 19 '21 16:04 nvictus

No, same result, but thanks for the help. I continued to work with the added fake chromosome as the results seems fine, so it's not an urgent issue anymore even though I would still like to know the cause.

iagooteroc avatar Apr 21 '21 08:04 iagooteroc

As a follow up to close this, I am able to successfully load the COO matrix on the exact input file you used by using --one-based. No need for the fake chromosome.

# generate chromsizes
python -c 'import bioframe as bf; bf.assembly_info("hg19").chromsizes.to_csv("hg19.chrom.sizes", sep="\t", header=False)'
# load the matrix
cooler load -f coo --one-based --field count:dtype=float hg19.chrom.sizes:40000 GSM3258549_SW480_40Kb.fullMatrix.txt.gz test.40000.cool

Binning hg19 at 40kb produces 77404 bins. With 0-based IDs, the largest bin ID would be 77403, but a simple search shows that the largest bin ID in GSM3258549_SW480_40Kb.fullMatrix.txt.gz is 77404 (corresponding to all of chrM), which triggered the original error, so the bin IDs must be encoded as 1-based.

manzt avatar Feb 06 '24 17:02 manzt