make_lastz_chains
make_lastz_chains copied to clipboard
chainCleaner failure with chromosomes >1Gb (2^30)
Hello again,
I was trying to align the corroboree frog and X. tropicalis genomes. The corroboree frog genome (GCF_028390025.1) is not prohibitingly large, but it includes a chromosome larger than 2^30 (1G) bases.
With a lot of masking, the make_lastz_chains
pipeline (v1.0.0) ran fine, except for the last "cleanChain" step.
I re-run (locally in a server with a large RAM) the cleanChains.csh
script generated by the pipeline as follows (after copying all relevant files, including the chain file used as the input, to ./
):
#!/bin/bash
set -e
set -o pipefail
chainCleaner ./GCF_028390025.1.GCF_000004195.4.beforeCleaning.chain.gz ./GCF_028390025.1.2bit ./GCF_000004195.4.2bit ./GCF_028390025.1.GCF_000004195.4.allfilled.chain removedSuspects.bed -linearGap=loose -tSizes=./GCF_028390025.1.chrom.sizes -qSizes=./GCF_000004195.4.chrom.sizes -LRfoldThreshold=2.5 -doPairs -LRfoldThresholdPairs=10 -maxPairDistance=10000 -maxSuspectScore=100000 -minBrokenChainScore=75000 >& ./log.chainCleaner
gzip ./GCF_028390025.1.GCF_000004195.4.allfilled.chain
This generated a temporary net
file and halted with a core dump.
Below is the log file.
Verbosity level: 1
foldThreshold: 0.000000 LRfoldThreshold: 2.500000 maxSuspectBases: 2147483647 maxSuspectScore: 100000 minBrokenChainScore: 75000 minLRGapSize: 0 doPairs with LRfoldThreshold: 10.000000 maxPairDistance 10000
0. need to net the input chains ./GCF_028390025.1.GCF_000004195.4.beforeCleaning.chain.gz (no net file given) ...
tempfile for netting: tmp.chainCleaner.Xu94iuE.net
Got 3127 chroms in ./GCF_028390025.1.chrom.sizes, 167 in ./GCF_000004195.4.chrom.sizes
Finishing nets
writing tmp.chainCleaner.Xu94iuE.net.raw
writing /dev/null
DONE (nets in tmp.chainCleaner.Xu94iuE.net)
1. parsing fills/gaps from tmp.chainCleaner.Xu94iuE.net and getting valid breaks ...
1.1 read net file tmp.chainCleaner.Xu94iuE.net into memory ...
DONE
1.2 get fills/gaps from tmp.chainCleaner.Xu94iuE.net ...
DONE
1.3 get aligning regions from tmp.chainCleaner.Xu94iuE.net ...
DONE
1.4 get valid breaks ...
chainCleaner: chainCleaner.c:938: newBreak: Assertion `breakP->RfillStart >= breakP->suspectEnd' failed.
The make_lastz_chains
pipeline run produced the same error.
...
I have the same error whenever I try to align genomes with a chromosome >1Gb (e.g., Schistocerca locust genomes), so I suspect that might be the cause.
Increasing CHAIN_CLEAN_MEMORY
did not help.
We heard from Hiram Clawson of UCSC that some of the UCSC tools (e.g., netChainSubset
) have codes limiting the maximum sequence length to be 2^30. I wonder if chainCleaner
or NetFilterNonNested.perl
include a code with a similar limitation.
...
Another question is, if this issue is not fixable easily, whether it will be acceptable to use the *.beforeCleaning.chain.gz
, basically skipping the last "cleanChain" step. How bad will the result be?
@MichaelHiller @kirilenkobm Please let me know if you need the chain and other files to reproduce this issue. Thanks!
Cheers, Dong-Ha