make_lastz_chains icon indicating copy to clipboard operation
make_lastz_chains copied to clipboard

A minor discrepancy between v1.0.0 and v2.0.8 at 'fill_chain' step

Open ohdongha opened this issue 4 months ago • 4 comments

Hi again,

Recently, we have been testing make_lastz_chains v2.0.8 (MLC v2) to run on our SGE grid. After adding some clusterOptions directives to the Nextflow template (execute_joblist.nf), the MLC v2 pipeline runs well.

However, the alignment results have always been slightly different from those with make_lastz_chains v1.0.0 (MLC v1), so we compared the temporary job scripts from each step.

One difference is that MLC v1 uses --chainMinScore 25000 when running chain_gap_filler.py during the "fill_chain" step, while MLC v2 uses --chainMinScore 1000 or any other value given with --min_chain_score when setting up the pipeline.

MLC v2 accepts --fill_chain_min_score separately, with a default value 25000.

But in the fill_chain_step.py code, it uses param.chain_min_score instead of param.fill_chain_min_score (#24) when building job scripts that run chain_gap_fillter.py:

$ grep -n -B8 -A6 chainMinScore ./make_lastz_chains-2.0.8/steps_implementations/fill_chain_step.py
16-def create_repeat_filler_joblist(params: PipelineParameters,
17-                                 project_paths: ProjectPaths,
18-                                 executables: StepExecutables):
19-    to_log("Creating repeat filler jobs list")
20-    infill_chain_filenames = os.listdir(project_paths.fill_chain_jobs_dir)
21-    to_log(f"fGot {len(infill_chain_filenames)} chain files to fill")
22-    lastz_parameters = f"\"K={params.fill_lastz_k} L={params.fill_lastz_l}\""
23-    repeat_filler_params = [
24:        f"--chainMinScore {params.chain_min_score}",
25-        f"--gapMaxSizeT {params.fill_gap_max_size_t}",
26-        f"--gapMaxSizeQ {params.fill_gap_max_size_q}",
27-        f"--scoreThreshold {params.fill_insert_chain_min_score}",
28-        f"--gapMinSizeT {params.fill_gap_min_size_t}",
29-        f"--gapMinSizeQ {params.fill_gap_min_size_q}",
30-    ]

I wonder if this is intended.

Replacing param.chain_min_score with param.fill_chain_min_score appears to reduce the number of final alignments slightly (after post-processing) without affecting the alignment coverage of CDS.

...

There are also differences in how the target and query sequences were chunked and how sequences smaller than the chunk size were treated during the lastz step, but for this, I think what v2 does makes more sense than v1. :)

Another difference is handling the lastz_q (or BLASTZ_Q) parameter during the "chain_run" step. I will write about this in another issue.

ohdongha avatar Oct 08 '24 16:10 ohdongha