nextclade
nextclade copied to clipboard
ENH(nextalign cli): show default values in --help usage statement
Hi! I'm trying out nextalign on norovirus genomes (small ssRNA, ~7.5kb, but highly diverged), and most sequences are unalignable with nextalign's default settings (Unable to align: low seed matching rate. Details: number of seeds: 73, number of seed matches: 2, matching rate: 0.027, required matching rate: 0.300. Note that this sequence will not be included in the results.
).
I'd like to try playing with the seed parameters. nextalign's --help
statement describes the params but not their default values:
--seed-length <SEED_LENGTH>
k-mer length to determine approximate alignments between query and reference and
determine the bandwidth of the banded alignment
--mismatches-allowed <MISMATCHES_ALLOWED>
Maximum number of mismatching nucleotides allowed for a seed to be considered a match
--min-seeds <MIN_SEEDS>
Minimum number of seeds to search for during nucleotide alignment. Relevant for short
sequences. In long sequences, the number of seeds is determined by `--seed-spacing`
--min-match-rate <MIN_MATCH_RATE>
Minimum seed mathing rate (a ratio of seed matches to total number of attempted seeds)
--seed-spacing <SEED_SPACING>
Spacing between seeds during nucleotide alignment
It would be nice to know the default values as a starting point for exploring the parameter space. I guess I could figure out the seed length from the 'Unable to align' messages 🙂 but it would be very nice if the --help
told them all. Thanks!
Thanks for the feedback, much appreciated 😊
We're very soon going to release version 3 which uses a much more robust seeding algorithm that should have no issues with Norovirus.
We're using clap for the cli, I'm not sure whether it's easy to display defaults. I agree that it would be good to show what they are with help.
Meanwhile you can find the defaults in parameters.rs, let me look it up.
On Mon, Sep 11, 2023, 23:10 Angie Hinrichs @.***> wrote:
Hi! I'm trying out nextalign on norovirus genomes (small ssRNA, ~7.5kb, but highly diverged), and most sequences are unalignable with nextalign's default settings (Unable to align: low seed matching rate. Details: number of seeds: 73, number of seed matches: 2, matching rate: 0.027, required matching rate: 0.300. Note that this sequence will not be included in the results.).
I'd like to try playing with the seed parameters. nextalign's --help statement describes the params but not their default values:
--seed-length <SEED_LENGTH> k-mer length to determine approximate alignments between query and reference and determine the bandwidth of the banded alignment --mismatches-allowed <MISMATCHES_ALLOWED> Maximum number of mismatching nucleotides allowed for a seed to be considered a match --min-seeds <MIN_SEEDS> Minimum number of seeds to search for during nucleotide alignment. Relevant for short sequences. In long sequences, the number of seeds is determined by `--seed-spacing` --min-match-rate <MIN_MATCH_RATE> Minimum seed mathing rate (a ratio of seed matches to total number of attempted seeds) --seed-spacing <SEED_SPACING> Spacing between seeds during nucleotide alignment
It would be nice to know the default values as a starting point for exploring the parameter space. I guess I could figure out the seed length from the 'Unable to align' messages 🙂 but it would be very nice if the --help told them all. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/nextstrain/nextclade/issues/1253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF77AQNSGAWFCT46A6CALCLXZ545VANCNFSM6AAAAAA4T5LZ2A . You are receiving this because you are subscribed to this thread.Message ID: @.***>
The hardcoded defaults for v2 are here (branch v2): https://github.com/nextstrain/nextclade/blob/242d56fbd8d8af67df3157bd047252f5580e3df8/packages_rs/nextclade/src/align/params.rs#L106-L131
For v3 (not stable, branch master) the hardcoded defaults are here: https://github.com/nextstrain/nextclade/blob/119cd4a33d7d08a57021a3ecd438f8f401be53b2/packages_rs/nextclade/src/align/params.rs#L141-L174
There are 2 important changes to consider in the upcoming Nextclade v3:
- alignment algo is changed quite a bit, so the params will change
- Nextalign executable is removed. Instead, Nextclade will take over the same job. In the new dataset format most files will be optional (and the dataset is also optional, so individual input args can be used) - all this to emulate the interface of Nextalign and to facilitate incremental development of datasets.
Because we are removing Nextalign, it does not make sense to add params into its help text anymore, as we are not planning any more releases.
Regarding Nextclade: the datasets can (and do) override parameters (using virus_properties.json
file for v2 and pathogen.json
in the v3), because different viruses sometimes need some different tuning. So I think that the displayed hardcoded number might be inaccurate and misleading, depending on which dataset you are planning to run. But let me know if you think it makes sense to add hardcoded defaults to Nextclade v3 anyways.
In the meantime, one thing you can try is to add -v
(--verbose
) flag to the run
command, and then the program should print the final values for this particular run, already taking into account values (in this order) in:
- dataset (if using Nextclade and if they are defined)
- CLI args (if an arg is provided)
- hardcoded defaults
UPD:
This statement is incorrect for v2:
already taking into account values (in this order) in
Nextclade/Nextalign v2 only print the CLI args, before merging-in the defaults, which is probably not very useful. This will change in v3.
If you want to try Nextclade v3:
You can download prebuilt binaries on GitHub Actions:
- Filter runs by branch "master": https://github.com/nextstrain/nextclade/actions?query=branch%3Amaster
- Go to "Artifacts" section
- Click on artifact named "out". It will be downloaded as "out.zip", which contains binaries for all platforms
Or you can build it from source, from master branch, using our dev guide: https://github.com/nextstrain/nextclade/blob/master/docs/dev/developer-guide.md
But v3 is not released and not stable yet. It's a bit of a crazy land still, and things might break. In which case you can try a slightly earlier version in the list of GitHub Actions. When things calm down a bit, we'll probably release an alpha version, or a few.
We appreciate early testing and feedback!
Thanks @ivan-aksamentov! I will give both a try. I see v3 can be run without a dataset if --input-ref is provided, great. 🚀