dorado
dorado copied to clipboard
Better documentation about the models?
0.5 is released together with [email protected] basecalling model. So we need to think about whether to switch or not.
Based on the limited documentation, I only know it is faster than 4.2.0. But is the base quality better or worse than 4.2.0???
Since we always do 5mCG_5hmCG calling, a big problem with 4.3.0 model is that it only supports remora model [email protected]_5mCG_5hmCG@v1. On the other hand, the old 4.2.0 supports [email protected][email protected]
However, the documentation regarding the remora models is even weaker. So I don't know if v3.1 is only a speed up of v1 or there is an accuracy improvement. If v3.1 is just a speed up and no accuracy improvement, then I am ok with upgrading to v4.3.0. Otherwise, I should just stick with 4.2.0. Is it possible to have a centralized place such that I can know what's going on with these models?
Thanks a lot in advance.
Hi @ymcki - ack on the limited documentation on our model updates and the version numbers since they can be confusing. We are working on putting something together and will publish soon.
To answer your question -
is the [v4.3.0] base quality better or worse than 4.2.0
v4.3.0 basecalling accuracy is better as well. For SUP there's almost 1 q-point improvement over v4.2.0. v4.3.0 is also better at calling low complexity regions and bacterial isolates.
The modbase model versions are conditioned on the basecaller version. So [email protected]_5mCG_5hmCG@v1
is the first release of the 5mCG_5hmCG
modbase model for the v4.3.0 basecaller model. If the baseline basecaller model changes (e.g. from v4.2.0 to v4.3.0), then the modbase model versions reset.
[email protected]_5mCG_5hmCG@v1
is the latest and greatest 5mCG_5hmCG model as it benefits from both canonical base call improvements + modbase improvements. So our strong recommendation is to move over to the newest models.
Hi. Got some questions as well on that regard. Is it possible to run the remora tied basecalling models such as [email protected]_5mCG_5hmCG
along custom remora models via --modified-bases-models? For the purpose of differentiating three unique modified bases in the same context for example
Hi @jorisbalc - those options cannot be used together. If you want to run [email protected] 5mCG_5hmCG and a custom remora model, you'll need to download the 5mCG model first and pass it through the modified-bases-models
option as well
$ dorado download --model [email protected]_5mCG_5hmCG@v1
$ dorado basecaller [email protected] <data> --modified-bases-models [email protected]_5mCG_5hmCG@v1 <custom_model> > out.bam
Thanks for the clarification. How do you avoid the [error] Maximum number of positional arguments exceeded
when adding multiple models via --modified-bases-models? Also, should I be retraining my remora model with updated basecalls whenever a new version of a basecaller model is released? Or is a remora model trained on v4.2.0 compatible with the v4.3.0 basecall model?
Hi @ymcki - ack on the limited documentation on our model updates and the version numbers since they can be confusing. We are working on putting something together and will publish soon.
To answer your question -
is the [v4.3.0] base quality better or worse than 4.2.0
v4.3.0 basecalling accuracy is better as well. For SUP there's almost 1 q-point improvement over v4.2.0. v4.3.0 is also better at calling low complexity regions and bacterial isolates.
The modbase model versions are conditioned on the basecaller version. So
[email protected]_5mCG_5hmCG@v1
is the first release of the5mCG_5hmCG
modbase model for the v4.3.0 basecaller model. If the baseline basecaller model changes (e.g. from v4.2.0 to v4.3.0), then the modbase model versions reset.
[email protected]_5mCG_5hmCG@v1
is the latest and greatest 5mCG_5hmCG model as it benefits from both canonical base call improvements + modbase improvements. So our strong recommendation is to move over to the newest models.
Thanks for your reply. It would be great if the new documentation will have numbers like running time and accuracy for each model such that we can make our decision on which one to pick for our application.
@jorisbalc
How do you avoid the
[error] Maximum number of positional arguments exceeded
when adding multiple models via --modified-bases-models?
can you share what cmd you're using?
Also, should I be retraining my remora model with updated basecalls whenever a new version of a basecaller model is released? Or is a remora model trained on v4.2.0 compatible with the v4.3.0 basecall model?
Remora is trained with updated basecalls usually. The is better asked in the remora github repo though - https://github.com/nanoporetech/remora/issues
@tijyojwad
Forgot to add it with the post but here it is:
v313@v313-GP66-Leopard-11UH:~$ dorado basecaller /home/v313/Dorado/models/[email protected] /media/v313/T7\ Shield/2023_12_06_FAT_PL/bc14/ --modified-bases-models /home/v313/Dorado/models/[email protected]_5mCG_5hmCG@v1 /media/v313/Dorado/models/remora/odyC_v1.0/ --reference /home/v313/ref-seqs/gm119_ref.fasta > calls_mC_hmC_odyC.bam
This should work according to the cmd you wrote, unless I'm missing something? The remora model was exported to dorado and works with the same cmd if it runs alone. I'm sure it's the positioning that I'm just not getting the hang of
Hi @tijyojwad
Splitting your command by whitspace shows you have 3 positional arguments here.
I believe only 2 positional arguments are expected for model
and data
respectively.
v313@v313-GP66-Leopard-11UH:~$
dorado
basecaller
/home/v313/Dorado/models/[email protected] # (model)
/media/v313/T7\ # (data)
Shield/2023_12_06_FAT_PL/bc14/ # (?)
--modified-bases-models
/home/v313/Dorado/models/[email protected]_5mCG_5hmCG@v1
/media/v313/Dorado/models/remora/odyC_v1.0/
--reference
/home/v313/ref-seqs/gm119_ref.fasta
> calls_mC_hmC_odyC.bam
Kind regards, Rich
I'd assume this was addressed to me. The T7 Shield is just a storage device for some of the data. It is not the problem here since the whitespace escaped via \
and treated as a normal character. The additional
Hi @jorisbalc ,
So to clairfy your command works without the <custom_model>
but not with?
v313@v313-GP66-Leopard-11UH:~$ dorado basecaller
/home/v313/Dorado/models/[email protected]
/media/v313/T7\ Shield/2023_12_06_FAT_PL/bc14/
--modified-bases-models
/home/v313/Dorado/models/[email protected]_5mCG_5hmCG@v1
<custom_model>
--reference
/home/v313/ref-seqs/gm119_ref.fasta
> calls_mC_hmC_odyC.bam
Correct, adding a second model after --modified-bases-models
treats it as third argument.
Example:
dorado basecaller <model> --modified-bases-models <model_mC_hmC> > calls.bam
works
dorado basecaller <model> --modified-bases-models <model_mC_hmC> <custom model> > calls.bam
does not work
I've tested locally and had the same issue.
Fortunately, the solution is as easy as it is easy to miss! We require a comma separated list of modbase models.
dorado basecaller --help
...
--modified-bases-models a comma separated list of modified base models [default: ""]
...
dorado basecaller <model> --modified-bases-models <model_mC_hmC>,<custom model> > calls.bam
Kind regards, Rich
Seems that I have overlooked this. Works fine now, thanks for the replies!
my bad @jorisbalc I missed the ,
as well! thanks @HalfPhoton for looking into this!
I am new...can someone help me to solve this error---Thank you (base) [mmolla@node304 ~]$ dorado basecaller /gpfs2/scratch/mmolla/Pod*.pod5 [email protected] --modified-bases-models >/gpfs2/scratch/mmolla/HL3/calls.bam [2024-04-15 12:56:19.354] [info] Running: "basecaller" "/gpfs2/scratch/mmolla/Pod*.pod5" "[email protected]" "--modified-bases-models" [2024-04-15 12:56:19.400] [info] > Creating basecall pipeline [2024-04-15 12:56:19.414] [error] toml::parse: file open error -> /gpfs2/scratch/mmolla/Pod*.pod5/config.toml (base) [mmolla@node304 ~]$ dorado basecaller /gpfs2/scratch/mmolla/Pod*.pod5 [email protected] --modified-bases-models >/gpfs2/scratch/mmolla/HL3/calls.bam
dorado basecaller /gpfs2/scratch/mmolla/Pod/*.pod5 [email protected] --modified-bases-models > /gpfs2/scratch/mmolla/HL3/calls.bam
@habibsaky, I've split this into a separate issue for you - https://github.com/nanoporetech/dorado/issues/749.
Closing this ticket as there has been changes to the models documentation and other issues have been resolved or moved to new issues.