make_lastz_chains icon indicating copy to clipboard operation
make_lastz_chains copied to clipboard

chainCleaner failure with chromosomes >1Gb (2^30)

Open ohdongha opened this issue 6 months ago • 7 comments

Hello again,

I was trying to align the corroboree frog and X. tropicalis genomes. The corroboree frog genome (GCF_028390025.1) is not prohibitingly large, but it includes a chromosome larger than 2^30 (1G) bases.

With a lot of masking, the make_lastz_chains pipeline (v1.0.0) ran fine, except for the last "cleanChain" step.

I re-run (locally in a server with a large RAM) the cleanChains.csh script generated by the pipeline as follows (after copying all relevant files, including the chain file used as the input, to ./):

#!/bin/bash
set -e
set -o pipefail
chainCleaner ./GCF_028390025.1.GCF_000004195.4.beforeCleaning.chain.gz ./GCF_028390025.1.2bit ./GCF_000004195.4.2bit ./GCF_028390025.1.GCF_000004195.4.allfilled.chain removedSuspects.bed -linearGap=loose  -tSizes=./GCF_028390025.1.chrom.sizes -qSizes=./GCF_000004195.4.chrom.sizes -LRfoldThreshold=2.5 -doPairs -LRfoldThresholdPairs=10 -maxPairDistance=10000 -maxSuspectScore=100000 -minBrokenChainScore=75000 >& ./log.chainCleaner
gzip ./GCF_028390025.1.GCF_000004195.4.allfilled.chain

This generated a temporary net file and halted with a core dump.

Below is the log file.

Verbosity level: 1
foldThreshold: 0.000000    LRfoldThreshold: 2.500000   maxSuspectBases: 2147483647  maxSuspectScore: 100000  minBrokenChainScore: 75000  minLRGapSize: 0 doPairs with LRfoldThreshold: 10.000000   maxPairDistance 10000
0. need to net the input chains ./GCF_028390025.1.GCF_000004195.4.beforeCleaning.chain.gz (no net file given) ...
                tempfile for netting: tmp.chainCleaner.Xu94iuE.net
Got 3127 chroms in ./GCF_028390025.1.chrom.sizes, 167 in ./GCF_000004195.4.chrom.sizes
Finishing nets
writing tmp.chainCleaner.Xu94iuE.net.raw
writing /dev/null
DONE (nets in tmp.chainCleaner.Xu94iuE.net)
1. parsing fills/gaps from tmp.chainCleaner.Xu94iuE.net and getting valid breaks ...
1.1 read net file tmp.chainCleaner.Xu94iuE.net into memory ...
DONE

1.2 get fills/gaps from tmp.chainCleaner.Xu94iuE.net ...
DONE

1.3 get aligning regions from tmp.chainCleaner.Xu94iuE.net ...
DONE

1.4 get valid breaks ...
chainCleaner: chainCleaner.c:938: newBreak: Assertion `breakP->RfillStart >= breakP->suspectEnd' failed.

The make_lastz_chains pipeline run produced the same error.

...

I have the same error whenever I try to align genomes with a chromosome >1Gb (e.g., Schistocerca locust genomes), so I suspect that might be the cause.

Increasing CHAIN_CLEAN_MEMORY did not help.

We heard from Hiram Clawson of UCSC that some of the UCSC tools (e.g., netChainSubset) have codes limiting the maximum sequence length to be 2^30. I wonder if chainCleaner or NetFilterNonNested.perl include a code with a similar limitation.

...

Another question is, if this issue is not fixable easily, whether it will be acceptable to use the *.beforeCleaning.chain.gz, basically skipping the last "cleanChain" step. How bad will the result be?

@MichaelHiller @kirilenkobm Please let me know if you need the chain and other files to reproduce this issue. Thanks!

Cheers, Dong-Ha

ohdongha avatar Aug 05 '24 00:08 ohdongha