cooler Improve chromsizes File Validation to Catch Formatting Errors Early

Fixes: #209

Original Issues: #142 & #124

Related Issues:

#142: Cryptic error message when chromsizes is not formatted properly
#124: Spaces instead of tabs in chromsizes file

Overview

This pull request improves the read_chromsizes function to catch formatting errors in chromsizes files early and provide clear, actionable error messages. Previously, issues like spaces instead of tabs, hidden characters, or malformed rows could slip through, causing confusing downstream errors (e.g., ValueError: cannot convert float NaN to integer). Now, the function validates the file format upfront, ensuring it’s tab-delimited, has exactly two columns, and contains valid integer lengths—making it more robust and user-friendly.

What Was Happening Before?

The Problem: Chromsizes files with formatting issues (e.g., spaces instead of tabs, extra columns, or non-numeric lengths) were parsed silently by pandas.read_csv. This led to NaN values in the length column, which crashed later steps like binning with vague errors.

Example Error:

cooler cload pairix --nproc 9 --assembly gal5 gal5Allele.chrom.sizes:1000 MNP-DT40-1-3-3-R1-T1__gal5.nodups.pairs.gz MNP-DT40-1-3-3-R1-T1__gal5.1000.cool
Traceback (most recent call last):
  ...
  ValueError: cannot convert float NaN to integer

Why It Happened:
- The original code didn’t check the file’s format before processing. For instance, a file like this:
```
chr1 1000000   extra_column
chr2\t2000000
```
  would parse 1000000 extra_column as a single value, resulting in NaN for length. Similarly, spaces instead of tabs (e.g., chr1 1000000) caused misparsing.

What’s Changed?

This update adds proactive checks to read_chromsizes to catch these issues right away. Here’s what’s new:

Strict Tab Enforcement:
- Before reading the file, we peek at the first line. If it contains spaces, we raise an error like:
```
ValueError: Chromsizes file 'gal5Allele.chrom.sizes' uses spaces instead of tabs as delimiters. Please use tabs.
```
- This fixes #124 by ensuring only tab-separated files are accepted.
Exact Two-Column Validation:
- We use pandas.read_csv with on_bad_lines="error", which rejects files with too few or too many columns (e.g., chr1\t1000000\textra or chr1). This prevents silent misparsing.

Numeric Length Check:

After loading the file, we convert the length column to numbers with pd.to_numeric(errors="coerce"). If any values turn into NaN (e.g., due to text like allele1 or hidden characters), we raise a detailed error:

ValueError: Chromsizes file 'gal5Allele.chrom.sizes' contains missing or invalid length values. Please ensure that the file is properly formatted as tab-delimited with two columns: sequence name and integer length. Check for extraneous spaces or hidden characters. Invalid rows:
  name    length
  chrX    NaN

This fixes #142 by replacing cryptic errors with something clear and helpful.

How It Works Now

Good File:
```
chr1\t1000000
chr2\t2000000
```
→ Works perfectly, returns a pd.Series with lengths indexed by chromosome names.
Bad File with Spaces:
```
chr1 1000000
chr2 2000000
```
→ Fails early: ValueError: Chromsizes file uses spaces instead of tabs...
Bad File with Invalid Lengths:
```
chr1\t1000000
chr2\tnot_a_number
```
→ Fails with: ValueError: Chromsizes file contains invalid length values... Invalid rows: chr2 NaN
Bad File with Extra Columns:
```
chr1\t1000000\textra
```
→ Fails with a pandas parsing error about mismatched columns.

Benefits

Early Detection: Catches errors before they cause downstream crashes.
Clear Feedback: Tells users exactly what’s wrong and how to fix it (e.g., “use tabs”, “check for invalid lengths”).
Robustness: Handles a wider range of formatting mistakes, like spaces, hidden characters, or extra columns.

Notes

This PR doesn’t add a verbose option, as per maintainer feedback—it’s not needed here.
Future tweaks (e.g., checking lengths are positive or sampling more lines for spaces) are noted but deferred for later.

Testing

Tested with:
- Valid tab-delimited files.
- Files with spaces instead of tabs.
- Files with non-numeric lengths or hidden characters.
- Files with extra or missing columns.

This update makes cooler more reliable and easier to use by catching chromsizes issues upfront with clear guidance for users.

Feb 26 '25 10:02 ShigrafS

Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of toy.chrom.sizes.

Feb 26 '25 19:02 nvictus

@nvictus Sure, I'll do that and let you know.

Feb 27 '25 09:02 ShigrafS

@nvictus I have added the unit test and made some minor tweaks as well. Kindly look into it.

Mar 01 '25 07:03 ShigrafS

@nvictus I've made all the required changes. Kindly look into it.

Mar 05 '25 07:03 ShigrafS

@nvictus The PR is ready to be merged.

Mar 10 '25 17:03 ShigrafS

@nvictus just flagging.

Mar 17 '25 18:03 vedatonuryilmaz

@nvictus This PR is ready to be merged. Kindly review it.

Apr 12 '25 13:04 ShigrafS

cooler cooler copied to clipboard

Improve chromsizes File Validation to Catch Formatting Errors Early

Related Issues:

Overview

What Was Happening Before?

What’s Changed?

How It Works Now

Benefits

Notes

Testing

cooler
cooler copied to clipboard