make_lastz_chains
make_lastz_chains copied to clipboard
Using an alternative scoring matrix in make_lastz_chains v.2.0.8
We have been adding BLASTZ_Q
parameter when running make_lastz_chains
v1.0.0 (MLC v1) to use an alternative scoring matrix (HoxD55 instead of the default HoxD70) for increased sensitivity when aligning more distant species (see issue #12).
Especially outside the mammalian or vertebrate clades, people quite often want to align genome pairs as far as human-chicken, human-frog, or human-zebrafish. We cannot say that HoxD55 is optimal for these pairs, but it has been better than HoxD70 and often retrieves substantially more alignment of CDS and synteny. I think the BLAST_Q
or lastz_q
in make_lastz_chains
v2.0.8 (MLC v2) that allows an alternative scoring matrix is a needed feature.
MLC v2 uses the same run_lastz.py
script as MLC v1 to set up lastz
job scripts. Hence, we could deliver the lastz_q
command similarly to issue #12. We created a one-line JSON file as follows and added it to the parameter set using --params_from_file ${one_line_JSON}
:
{ "lastz_q": "${path_to_HoxD55.q_file}" }
With this, lastz
was run with the HoxD55 matrix as expected. However, the MLC v2 "chain_run" step was not working the same way as MLC v1. This was because the chain_run_step.py
script does not add lastz_q
parameter to the axtChain
job scripts as -scoreScheme=${matrix}
parameter.
We tried to modify the chain_run_step.py
as below (lines 53, 79, and 80 with "# add lastz_q if it exists" are newly added):
$ grep -n -A36 "def make_chains_joblist" make_lastz_chains-2.0.8/steps_implementations/chain_run_step.py
47:def make_chains_joblist(project_paths: ProjectPaths,
48- params: PipelineParameters,
49- executables: StepExecutables):
50- # Prepare parameters
51- seq1_dir = params.seq_1_dir
52- seq2_dir = params.seq_2_dir
53- matrix = params.lastz_q if hasattr(params, 'lastz_q') else "" # add lastz_q if it exists
54- # matrix = ""
55- min_score = params.chain_min_score
56- linear_gap = params.chain_linear_gap
57- bundle_filenames = os.listdir(project_paths.split_psl_dir)
58- to_log(f"Building axtChain joblist for {len(bundle_filenames)} bundled psl files")
59-
60- cluster_jobs = []
61- for bundle_filename in bundle_filenames:
62- in_path = os.path.join(project_paths.split_psl_dir, bundle_filename)
63- out_path = os.path.join(project_paths.chain_output_dir, f"{bundle_filename}.chain")
64- cmd = [executables.axt_chain,
65- "-psl",
66- "-verbose=0",
67- f"-minScore={min_score}",
68- f"-linearGap={linear_gap}",
69- in_path,
70- seq1_dir,
71- seq2_dir,
72- "stdout",
73- "|",
74- executables.chain_anti_repeat,
75- seq1_dir,
76- seq2_dir,
77- "stdin",
78- out_path]
79- if matrix != "" : # add lastz_q if it exists
80- cmd.insert(3, f"-scoreScheme={matrix}") # add lastz_q if it exists
81- cluster_jobs.append(" ".join(cmd))
82- return cluster_jobs
83-
This modification will add the matrix parameter ONLY if the lastz_q
parameter was ever declared. With this modification, MLC v2 alignments for distant pairs with HoxD55 matrix scored as high alignment CDS coverage as those we did with MLC v1. Without it, the alignment CDS coverages were 4~10% lower than what we had with MLC v1 in multiple tests.
...
This was a bit of a "band-aid" fix, and we had to deliver the lastz_q
matrix as a JSON parameter file. This seemed to work for the lastz
and "chain_run" steps, printing the job scripts equivalent to those we saw in MLC v1. I am not sure, though, if these two steps are the only place where the matrix parameter needs to be added. I wonder if the "fill_chain" or "clean_chain" step also needs the matrix parameter if we are not using the default. Anyways, delivering the HoxD55 matrix to lastz
and "chain_run" steps recovered the alignment stats (CDS coverage) very close to what we had with MLC v1.
You could more formally add the --lastz_q
parameter to make_chains.py
. One tricky thing is that, unlike other alphabet parameters (lastz_h
, lastz_y
, etc.), lastz_q
cannot be an empty string if it is declared at all. You may have to include the default HoxD70 matrix as a file and use its location as the default value of lastz_q
.