dorado icon indicating copy to clipboard operation
dorado copied to clipboard

Better documentation about the models?

Open ymcki opened this issue 1 year ago • 17 comments

0.5 is released together with [email protected] basecalling model. So we need to think about whether to switch or not.

Based on the limited documentation, I only know it is faster than 4.2.0. But is the base quality better or worse than 4.2.0???

Since we always do 5mCG_5hmCG calling, a big problem with 4.3.0 model is that it only supports remora model [email protected]_5mCG_5hmCG@v1. On the other hand, the old 4.2.0 supports [email protected][email protected]

However, the documentation regarding the remora models is even weaker. So I don't know if v3.1 is only a speed up of v1 or there is an accuracy improvement. If v3.1 is just a speed up and no accuracy improvement, then I am ok with upgrading to v4.3.0. Otherwise, I should just stick with 4.2.0. Is it possible to have a centralized place such that I can know what's going on with these models?

Thanks a lot in advance.

ymcki avatar Dec 08 '23 01:12 ymcki

Hi @ymcki - ack on the limited documentation on our model updates and the version numbers since they can be confusing. We are working on putting something together and will publish soon.

To answer your question -

is the [v4.3.0] base quality better or worse than 4.2.0

v4.3.0 basecalling accuracy is better as well. For SUP there's almost 1 q-point improvement over v4.2.0. v4.3.0 is also better at calling low complexity regions and bacterial isolates.

The modbase model versions are conditioned on the basecaller version. So [email protected]_5mCG_5hmCG@v1 is the first release of the 5mCG_5hmCG modbase model for the v4.3.0 basecaller model. If the baseline basecaller model changes (e.g. from v4.2.0 to v4.3.0), then the modbase model versions reset.

[email protected]_5mCG_5hmCG@v1 is the latest and greatest 5mCG_5hmCG model as it benefits from both canonical base call improvements + modbase improvements. So our strong recommendation is to move over to the newest models.

tijyojwad avatar Dec 08 '23 17:12 tijyojwad

Hi. Got some questions as well on that regard. Is it possible to run the remora tied basecalling models such as [email protected]_5mCG_5hmCG along custom remora models via --modified-bases-models? For the purpose of differentiating three unique modified bases in the same context for example

jorisbalc avatar Dec 14 '23 11:12 jorisbalc

Hi @jorisbalc - those options cannot be used together. If you want to run [email protected] 5mCG_5hmCG and a custom remora model, you'll need to download the 5mCG model first and pass it through the modified-bases-models option as well

$ dorado download --model [email protected]_5mCG_5hmCG@v1
$ dorado basecaller [email protected] <data> --modified-bases-models [email protected]_5mCG_5hmCG@v1 <custom_model> > out.bam

tijyojwad avatar Dec 14 '23 20:12 tijyojwad

Thanks for the clarification. How do you avoid the [error] Maximum number of positional arguments exceeded when adding multiple models via --modified-bases-models? Also, should I be retraining my remora model with updated basecalls whenever a new version of a basecaller model is released? Or is a remora model trained on v4.2.0 compatible with the v4.3.0 basecall model?

jorisbalc avatar Dec 15 '23 13:12 jorisbalc

Hi @ymcki - ack on the limited documentation on our model updates and the version numbers since they can be confusing. We are working on putting something together and will publish soon.

To answer your question -

is the [v4.3.0] base quality better or worse than 4.2.0

v4.3.0 basecalling accuracy is better as well. For SUP there's almost 1 q-point improvement over v4.2.0. v4.3.0 is also better at calling low complexity regions and bacterial isolates.

The modbase model versions are conditioned on the basecaller version. So [email protected]_5mCG_5hmCG@v1 is the first release of the 5mCG_5hmCG modbase model for the v4.3.0 basecaller model. If the baseline basecaller model changes (e.g. from v4.2.0 to v4.3.0), then the modbase model versions reset.

[email protected]_5mCG_5hmCG@v1 is the latest and greatest 5mCG_5hmCG model as it benefits from both canonical base call improvements + modbase improvements. So our strong recommendation is to move over to the newest models.

Thanks for your reply. It would be great if the new documentation will have numbers like running time and accuracy for each model such that we can make our decision on which one to pick for our application.

ymcki avatar Dec 19 '23 05:12 ymcki

@jorisbalc

How do you avoid the [error] Maximum number of positional arguments exceeded when adding multiple models via --modified-bases-models?

can you share what cmd you're using?

Also, should I be retraining my remora model with updated basecalls whenever a new version of a basecaller model is released? Or is a remora model trained on v4.2.0 compatible with the v4.3.0 basecall model?

Remora is trained with updated basecalls usually. The is better asked in the remora github repo though - https://github.com/nanoporetech/remora/issues

tijyojwad avatar Dec 19 '23 14:12 tijyojwad

@tijyojwad

Forgot to add it with the post but here it is:

v313@v313-GP66-Leopard-11UH:~$  dorado basecaller /home/v313/Dorado/models/[email protected] /media/v313/T7\ Shield/2023_12_06_FAT_PL/bc14/ --modified-bases-models /home/v313/Dorado/models/[email protected]_5mCG_5hmCG@v1 /media/v313/Dorado/models/remora/odyC_v1.0/ --reference /home/v313/ref-seqs/gm119_ref.fasta > calls_mC_hmC_odyC.bam

This should work according to the cmd you wrote, unless I'm missing something? The remora model was exported to dorado and works with the same cmd if it runs alone. I'm sure it's the positioning that I'm just not getting the hang of

jorisbalc avatar Dec 19 '23 15:12 jorisbalc

Hi @tijyojwad

Splitting your command by whitspace shows you have 3 positional arguments here. I believe only 2 positional arguments are expected for model and data respectively.

v313@v313-GP66-Leopard-11UH:~$  
dorado
basecaller 
/home/v313/Dorado/models/[email protected] # (model)
/media/v313/T7\ # (data)
Shield/2023_12_06_FAT_PL/bc14/ # (?)
--modified-bases-models
/home/v313/Dorado/models/[email protected]_5mCG_5hmCG@v1
/media/v313/Dorado/models/remora/odyC_v1.0/ 
--reference 
/home/v313/ref-seqs/gm119_ref.fasta 
> calls_mC_hmC_odyC.bam

Kind regards, Rich

HalfPhoton avatar Jan 02 '24 15:01 HalfPhoton

I'd assume this was addressed to me. The T7 Shield is just a storage device for some of the data. It is not the problem here since the whitespace escaped via \ and treated as a normal character. The additional argument after --modified-bases-models <dna_r10.4.1...> is what does not work for me, since in the cmd @tijyojwad mentioned, the custom model is the third argument.

jorisbalc avatar Jan 02 '24 16:01 jorisbalc

Hi @jorisbalc ,

So to clairfy your command works without the <custom_model> but not with?

v313@v313-GP66-Leopard-11UH:~$  dorado basecaller  
/home/v313/Dorado/models/[email protected] 
/media/v313/T7\ Shield/2023_12_06_FAT_PL/bc14/ 
--modified-bases-models 
/home/v313/Dorado/models/[email protected]_5mCG_5hmCG@v1 
<custom_model>
--reference 
/home/v313/ref-seqs/gm119_ref.fasta 
> calls_mC_hmC_odyC.bam

HalfPhoton avatar Jan 02 '24 16:01 HalfPhoton

Correct, adding a second model after --modified-bases-models treats it as third argument.

Example: dorado basecaller <model> --modified-bases-models <model_mC_hmC> > calls.bam works dorado basecaller <model> --modified-bases-models <model_mC_hmC> <custom model> > calls.bam does not work

jorisbalc avatar Jan 02 '24 16:01 jorisbalc

I've tested locally and had the same issue.

Fortunately, the solution is as easy as it is easy to miss! We require a comma separated list of modbase models.

dorado basecaller --help
...
  --modified-bases-models       a comma separated list of modified base models [default: ""]
...
dorado basecaller <model> --modified-bases-models <model_mC_hmC>,<custom model> > calls.bam

Kind regards, Rich

HalfPhoton avatar Jan 02 '24 16:01 HalfPhoton

Seems that I have overlooked this. Works fine now, thanks for the replies!

jorisbalc avatar Jan 02 '24 17:01 jorisbalc

my bad @jorisbalc I missed the , as well! thanks @HalfPhoton for looking into this!

tijyojwad avatar Jan 02 '24 19:01 tijyojwad

I am new...can someone help me to solve this error---Thank you (base) [mmolla@node304 ~]$ dorado basecaller /gpfs2/scratch/mmolla/Pod*.pod5 [email protected] --modified-bases-models >/gpfs2/scratch/mmolla/HL3/calls.bam [2024-04-15 12:56:19.354] [info] Running: "basecaller" "/gpfs2/scratch/mmolla/Pod*.pod5" "[email protected]" "--modified-bases-models" [2024-04-15 12:56:19.400] [info] > Creating basecall pipeline [2024-04-15 12:56:19.414] [error] toml::parse: file open error -> /gpfs2/scratch/mmolla/Pod*.pod5/config.toml (base) [mmolla@node304 ~]$ dorado basecaller /gpfs2/scratch/mmolla/Pod*.pod5 [email protected] --modified-bases-models >/gpfs2/scratch/mmolla/HL3/calls.bam

habibsaky avatar Apr 15 '24 17:04 habibsaky

dorado basecaller /gpfs2/scratch/mmolla/Pod/*.pod5 [email protected] --modified-bases-models > /gpfs2/scratch/mmolla/HL3/calls.bam

habibsaky avatar Apr 15 '24 18:04 habibsaky

@habibsaky, I've split this into a separate issue for you - https://github.com/nanoporetech/dorado/issues/749.

malton-ont avatar Apr 17 '24 08:04 malton-ont

Closing this ticket as there has been changes to the models documentation and other issues have been resolved or moved to new issues.

HalfPhoton avatar Sep 17 '24 10:09 HalfPhoton