make_lastz_chains
make_lastz_chains copied to clipboard
A minor discrepancy between v1.0.0 and v2.0.8 at 'fill_chain' step
Hi again,
Recently, we have been testing make_lastz_chains
v2.0.8 (MLC v2) to run on our SGE grid. After adding some clusterOptions
directives to the Nextflow
template (execute_joblist.nf
), the MLC v2 pipeline runs well.
However, the alignment results have always been slightly different from those with make_lastz_chains
v1.0.0 (MLC v1), so we compared the temporary job scripts from each step.
One difference is that MLC v1 uses --chainMinScore 25000
when running chain_gap_filler.py
during the "fill_chain" step, while MLC v2 uses --chainMinScore 1000
or any other value given with --min_chain_score
when setting up the pipeline.
MLC v2 accepts --fill_chain_min_score
separately, with a default value 25000.
But in the fill_chain_step.py
code, it uses param.chain_min_score
instead of param.fill_chain_min_score
(#24) when building job scripts that run chain_gap_fillter.py
:
$ grep -n -B8 -A6 chainMinScore ./make_lastz_chains-2.0.8/steps_implementations/fill_chain_step.py
16-def create_repeat_filler_joblist(params: PipelineParameters,
17- project_paths: ProjectPaths,
18- executables: StepExecutables):
19- to_log("Creating repeat filler jobs list")
20- infill_chain_filenames = os.listdir(project_paths.fill_chain_jobs_dir)
21- to_log(f"fGot {len(infill_chain_filenames)} chain files to fill")
22- lastz_parameters = f"\"K={params.fill_lastz_k} L={params.fill_lastz_l}\""
23- repeat_filler_params = [
24: f"--chainMinScore {params.chain_min_score}",
25- f"--gapMaxSizeT {params.fill_gap_max_size_t}",
26- f"--gapMaxSizeQ {params.fill_gap_max_size_q}",
27- f"--scoreThreshold {params.fill_insert_chain_min_score}",
28- f"--gapMinSizeT {params.fill_gap_min_size_t}",
29- f"--gapMinSizeQ {params.fill_gap_min_size_q}",
30- ]
I wonder if this is intended.
Replacing param.chain_min_score
with param.fill_chain_min_score
appears to reduce the number of final alignments slightly (after post-processing) without affecting the alignment coverage of CDS.
...
There are also differences in how the target and query sequences were chunked and how sequences smaller than the chunk size were treated during the lastz
step, but for this, I think what v2 does makes more sense than v1. :)
Another difference is handling the lastz_q
(or BLASTZ_Q
) parameter during the "chain_run" step. I will write about this in another issue.