medaka icon indicating copy to clipboard operation
medaka copied to clipboard

Align model names

Open Kirk3gaard opened this issue 4 years ago • 2 comments

Hi

Keeping track of all the ONT version numbers can be a bit tricky. Maybe it would be relevant to align the names from the guppy basecaller with medaka. Now R10 data would need the dna_r10_450bps_hac model for guppy but the name for medaka consensus polishing is r10_min_high.

Guppy v. 3.2.2 R10 models: FLO-MIN110 SQK-CAS109 dna_r10_450bps_hac FLO-MIN110 SQK-DCS108 dna_r10_450bps_hac FLO-MIN110 SQK-DCS109 dna_r10_450bps_hac FLO-MIN110 SQK-LRK001 dna_r10_450bps_hac FLO-MIN110 SQK-LSK108 dna_r10_450bps_hac FLO-MIN110 SQK-LSK109 dna_r10_450bps_hac FLO-MIN110 SQK-LSK109-XL dna_r10_450bps_hac FLO-MIN110 SQK-LWP001 dna_r10_450bps_hac FLO-MIN110 SQK-PCS108 dna_r10_450bps_hac FLO-MIN110 SQK-PCS109 dna_r10_450bps_hac FLO-MIN110 SQK-PSK004 dna_r10_450bps_hac FLO-MIN110 SQK-RAD002 dna_r10_450bps_hac FLO-MIN110 SQK-RAD003 dna_r10_450bps_hac FLO-MIN110 SQK-RAD004 dna_r10_450bps_hac FLO-MIN110 SQK-RAS201 dna_r10_450bps_hac FLO-MIN110 SQK-RLI001 dna_r10_450bps_hac FLO-MIN110 VSK-VBK001 dna_r10_450bps_hac FLO-MIN110 VSK-VSK001 dna_r10_450bps_hac FLO-MIN110 VSK-VSK002 dna_r10_450bps_hac FLO-MIN110 SQK-16S024 included dna_r10_450bps_hac FLO-MIN110 SQK-PCB109 included dna_r10_450bps_hac FLO-MIN110 SQK-RBK001 included dna_r10_450bps_hac FLO-MIN110 SQK-RBK004 included dna_r10_450bps_hac FLO-MIN110 SQK-RLB001 included dna_r10_450bps_hac FLO-MIN110 SQK-LWB001 included dna_r10_450bps_hac FLO-MIN110 SQK-PBK004 included dna_r10_450bps_hac FLO-MIN110 SQK-RAB201 included dna_r10_450bps_hac FLO-MIN110 SQK-RAB204 included dna_r10_450bps_hac FLO-MIN110 SQK-RPB004 included dna_r10_450bps_hac FLO-MIN110 VSK-VMK001 included dna_r10_450bps_hac FLO-MIN110 VSK-VMK002 included dna_r10_450bps_hac

Medaka v. 0.8.1 models -m medaka model, (default: r941_min_high). Available: r941_trans, r941_flip213, r941_flip235, r941_min_fast, r941_min_high, r941_prom_fast, r941_prom_high, r10_min_high

Kirk3gaard avatar Sep 04 '19 10:09 Kirk3gaard

I agree there's some unfortunate confusion here.

The is indeed a close connection between the guppy model used and the appropriate medaka model. Guppy however has a variety of methods to specify its configuration such that the model eventually selected for use might not be known by the user (e.g. if a user specifies a kit).

Therefore the medaka models were chosen to be simple, yet convey the important points of <pore>_<platform>_<basecaller variant>; such that one can make an educated guess at the correct model to use for a dataset. _high was chosen over _hac because it was felt to more obviously relate to high accuracy basecalling to users unaware of the the hac jargon.

In this simplifying spirit speed (450bps) was left out of the medaka model names as its not something user selectable, while rna/dna was dropped because medaka does not deal in RNA.

Medaka also needs to deal with the fact that specific basecaller models may need specific medaka models, so we will likely extend <pore>_<platform>_<variant> to include _<guppy model version> in the future whilst maintaining the former form as a shortcut to the most recent. This detail (distinction) is not something that guppy currently makes: its model names do not include a versioning.

I will see what we can do to be more co-ordinated on this front.

cjw85 avatar Sep 27 '19 10:09 cjw85

@cjw85 Would it be possible to stick to Major.Minor instead of MajorMinorPatch for _<guppy model version>?

fbemm avatar Apr 23 '20 11:04 fbemm

Medaka v1.11.0 introduces the ability to introspect input files in order to determine the correct model to use, see this section of the README.

This is mainly of use with dorado which versions models independently of the basecaller software version. For older models the previously heuristics must be used.

cjw85 avatar Oct 23 '23 21:10 cjw85