buttery-eel icon indicating copy to clipboard operation
buttery-eel copied to clipboard

Model convention and availability of the aligner in buttery-eel

Open hasindu2008 opened this issue 5 months ago • 17 comments

@SBurnard Moving this https://github.com/hasindu2008/nci-scripts/issues/1 conversation to here as these questions are about buttery-eel.

It is a great suggestion about the model conversions. @Psy-Fer, we should maintain some server to standalone model mapping page in https://github.com/Psy-Fer/buttery-eel/blob/main/docs/. The tricky thing is these models keep changing from version to version, so perhaps we can document the use of the following command.

cd /path/to/ont-dorado-server/data/
grep "model" *.cfg | tr ':' '\t' | tr '=' '\t' | awk '{print $1"\t"$2"\t"$3}' | sort -k1,1

@SBurnard Buttery-eel relies on the dorado-server from ONT (which does the live basecalling in MinKNOW) to implement the basecalling. So these model configuration convention comes from ONT's dorado-server and due to some reason, they have a different convention in standalone Dorado. How I find the models is as follows on Gadi:

cd /g/data/if89/apps/buttery-eel/0.5.1+dorado7.4.12/ont-dorado-server/data/
grep "model" *.cfg | tr ':' '\t' | tr '=' '\t' | awk '{print $1"\t"$2"\t"$3}' | sort -k1,1

About the second question, slow5-dorado is a fork of the standalone Dorado, so all the extra features in Dorado such as alignment are there. But we have not made a release recently:

  1. Dorado has a zillion dependencies and make a few days to get the things compiled
  2. The codebase changes are upside-down changes making it hard to keep adding the slow5 support

The good thing with the dorado-server is we can simply get the binary from ONT and use the client-server approach (implemented in buttery-eel) to access BLOW5 files.

I am not sure if Dorado server supports alignment. @Psy-Fer Does it? However, even if it supports alignment I personally believe that having basecalling and alignment to be modular has greater benefits:

  1. The user can transparently know which minimap version and parameters they use, can tune parameters for their needs and even change to a different aligner if they wish
  2. Having it separate means that users will likely cite those aligners they use, which would otherwise be just buried under "Dorado"
  3. I rather trust standalone minimap2 than a modified version coming from ONT. In fact, in f5c, several issues that arose ended up finally being attributed to some weird thing in Dorado alignment
  4. ONT has a track record of NOT honouring backward compatibility, so there is a chance that the API for getting the alignment information would keep changing (thus we will get an extra thing to rewrite things everytime)
  5. Having separate modules improves the maintainability. "One tool does all the things" approach leads to complex systems that have their own set of problems, and would create a dependency and maintenance nightmare.

I can go on .....

Let me cite the following extract from Heng Li, Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences, Bioinformatics "We hope in this process, the community could standardize the input and output formats of various tools, so that a developer could focus on a component he or she understands best. Such a modular approach has been proved to be fruitful in the development of short-read tools—in fact, the best short-read pipelines all consist of components developed by different groups—and will be equally beneficial to the future development of long-read mappers and assemblers."

I understand that having a single command that runs all could be convenient, but not sure if it really worth considering the above factors. What do you think?

hasindu2008 avatar Oct 01 '24 04:10 hasindu2008