cooler
cooler copied to clipboard
Improve chromsizes File Validation to Catch Formatting Errors Early
Fixes: #209
Original Issues: #142 & #124
Related Issues:
- #142: Cryptic error message when chromsizes is not formatted properly
- #124: Spaces instead of tabs in chromsizes file
Overview
This pull request improves the read_chromsizes function to catch formatting errors in chromsizes files early and provide clear, actionable error messages. Previously, issues like spaces instead of tabs, hidden characters, or malformed rows could slip through, causing confusing downstream errors (e.g., ValueError: cannot convert float NaN to integer). Now, the function validates the file format upfront, ensuring it’s tab-delimited, has exactly two columns, and contains valid integer lengths—making it more robust and user-friendly.
What Was Happening Before?
- The Problem: Chromsizes files with formatting issues (e.g., spaces instead of tabs, extra columns, or non-numeric lengths) were parsed silently by
pandas.read_csv. This led toNaNvalues in thelengthcolumn, which crashed later steps like binning with vague errors. - Example Error:
cooler cload pairix --nproc 9 --assembly gal5 gal5Allele.chrom.sizes:1000 MNP-DT40-1-3-3-R1-T1__gal5.nodups.pairs.gz MNP-DT40-1-3-3-R1-T1__gal5.1000.cool Traceback (most recent call last): ... ValueError: cannot convert float NaN to integer - Why It Happened:
- The original code didn’t check the file’s format before processing. For instance, a file like this:
would parsechr1 1000000 extra_column chr2\t20000001000000 extra_columnas a single value, resulting inNaNforlength. Similarly, spaces instead of tabs (e.g.,chr1 1000000) caused misparsing.
- The original code didn’t check the file’s format before processing. For instance, a file like this:
What’s Changed?
This update adds proactive checks to read_chromsizes to catch these issues right away. Here’s what’s new:
-
Strict Tab Enforcement:
- Before reading the file, we peek at the first line. If it contains spaces, we raise an error like:
ValueError: Chromsizes file 'gal5Allele.chrom.sizes' uses spaces instead of tabs as delimiters. Please use tabs. - This fixes #124 by ensuring only tab-separated files are accepted.
- Before reading the file, we peek at the first line. If it contains spaces, we raise an error like:
-
Exact Two-Column Validation:
- We use
pandas.read_csvwithon_bad_lines="error", which rejects files with too few or too many columns (e.g.,chr1\t1000000\textraorchr1). This prevents silent misparsing.
- We use
-
Numeric Length Check:
- After loading the file, we convert the
lengthcolumn to numbers withpd.to_numeric(errors="coerce"). If any values turn intoNaN(e.g., due to text likeallele1or hidden characters), we raise a detailed error:ValueError: Chromsizes file 'gal5Allele.chrom.sizes' contains missing or invalid length values. Please ensure that the file is properly formatted as tab-delimited with two columns: sequence name and integer length. Check for extraneous spaces or hidden characters. Invalid rows: name length chrX NaN - This fixes #142 by replacing cryptic errors with something clear and helpful.
- After loading the file, we convert the
How It Works Now
-
Good File:
chr1\t1000000 chr2\t2000000→ Works perfectly, returns a
pd.Serieswith lengths indexed by chromosome names. -
Bad File with Spaces:
chr1 1000000 chr2 2000000→ Fails early:
ValueError: Chromsizes file uses spaces instead of tabs... -
Bad File with Invalid Lengths:
chr1\t1000000 chr2\tnot_a_number→ Fails with:
ValueError: Chromsizes file contains invalid length values... Invalid rows: chr2 NaN -
Bad File with Extra Columns:
chr1\t1000000\textra→ Fails with a
pandasparsing error about mismatched columns.
Benefits
- Early Detection: Catches errors before they cause downstream crashes.
- Clear Feedback: Tells users exactly what’s wrong and how to fix it (e.g., “use tabs”, “check for invalid lengths”).
- Robustness: Handles a wider range of formatting mistakes, like spaces, hidden characters, or extra columns.
Notes
- This PR doesn’t add a
verboseoption, as per maintainer feedback—it’s not needed here. - Future tweaks (e.g., checking lengths are positive or sampling more lines for spaces) are noted but deferred for later.
Testing
- Tested with:
- Valid tab-delimited files.
- Files with spaces instead of tabs.
- Files with non-numeric lengths or hidden characters.
- Files with extra or missing columns.
This update makes cooler more reliable and easier to use by catching chromsizes issues upfront with clear guidance for users.
Thank you for the contribution @ShigrafS! Would you mind adding a simple unit test that confirms the exception gets raised with bad input? You can use a broken version of toy.chrom.sizes.
@nvictus Sure, I'll do that and let you know.
@nvictus I have added the unit test and made some minor tweaks as well. Kindly look into it.
@nvictus I've made all the required changes. Kindly look into it.
@nvictus The PR is ready to be merged.
@nvictus just flagging.
@nvictus This PR is ready to be merged. Kindly review it.