bonito
bonito copied to clipboard
base question
Hi,
My goal is to take a set of .fast5 data derived from 9.4 flow cells base call with bonito. What I'm unclear about is whether I need to train my own model following your instructions, or if I can skip the training step and use an existing model you've already made available (dna_r9.4.1
) and proceed directly to base calling like you describe in the opening section of this repo:
bonito basecaller dna_r9.4.1 /data/reads > basecalls.fasta
If there is significant value in training with my own reads, I was wondering if it would be advised to use all the data available, or just some fraction. You provide minimums to consider in a .ipynb notebook, but I wondered if perhaps there were other ballpark thresholds to target when many more reads are available. I'm working with about 15x coverage of a 2Gb mammalian (bat) genome, and figured all those squiggles probably weren't necessary to build a model for Bonito to use.
If there was documentation available elsewhere on this repo or the Nanopore Community page, I apologize for the redundant question.
Thank you
Hey @devonorourke
You can absolutely skip training your own model and start out with the provided dna_r9.4.1
model.
I'd start with only a fraction of reads for training as you suggest, this is however an active area of research for us and something that we are experimenting with currently. In the next release we will make it simpler to take the pre-trained dna_r9.4.1
model and fine-tune it with your own reads, this will probably be a good next step after trying the provided model.
All the documentation is here so you came to the right place.
HTH
Chris.
Wonderful - thank you very
Minor follow up Chris -
Do you want questions related to Bonito usage posted to this repo, or would you prefer them posted to the Nanopore Community page? I thought I'd try one question here first, but happy to repost whenever you recommend.
I've run the program on a tiny set of my own data:
bonito basecaller dna_r9.4.1 /path_to/test_files > basecalls_bonito_test.fasta
with this brief summary in the .log file:
> loading model
> completed reads: 1581
> duration: 0:15:52
> samples per second 6.0E+04
> done
Given that I have about 5 million reads to process, I think that extrapolates to something like 40 days of base calling 😩 ?
I can see from the help menu there is a --chunksize CHUNKSIZE
parameter that might allow me to speed things up a bit, but perhaps the default settings are as good as I'm going to get with a single Tesla GPU device?
If any users have info about optimization strategies it would be great to hear. I believe our cluster has two Tesla GPUs available, so perhaps I can cut that in half if Bonito can access them simultaneously? Alternatively, maybe I needed to tell our compute cluster's job scheduler software (Slurm) that it should allocate more of the Tesla GPU to this task.
No worries if the whole process takes 40 days - just wanted to ensure I wasn't missing something obvious. With Guppy, for example, I understand I can modify the number of CPU resources with --num_callers
and --cpu_threads_per_caller
parameters, but wasn't clear if there was something equivalent to the GPU usage with Bonito.
Have a great day
Here is perfect. You shouldn't have to tweak any settings to get good performance with Bonito.
That's extremely slow (~30X slower than a V100) what GPU do you have?
Very sorry Chris - I'm brand new to GPU use. I think the answer you're asking is a Tesla GPU (I specify that type in my Slurm shell script). If there is another way to interrogate the cluster for more specific GPU details I'm happy to try. Just let me know what you'd recommend.
Appreciate your help as I stumble forward :)
On Mon, Dec 14, 2020, 4:03 PM Chris Seymour [email protected] wrote:
Here is perfect. You shouldn't have to tweak any settings to get good performance with Bonito.
That's extremely slow (~30X slower than a V100) what GPU do you have?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nanoporetech/bonito/issues/95#issuecomment-744709109, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACVKAXGBUYHT7GFGQLEBF2TSUZ4TVANCNFSM4U3EFPQA .
try this command nvidia-smi
in your GPU node. It will give you the GPU info.
Thank you for the advice with the commands @defendant602 ,
Initially I did manage to get nvidia-smi
to run properly, but it suggested yet another mistake - it appeared that I was running on the K80 GPU device, not the v100 that @iiSeymour suggested earlier. Eventually with a modified Slurm script I was able to specify the V100 device on our cluster. This led to a clear improvement from the earlier test:
> loading model
> completed reads: 989
> duration: 0:02:25
> samples per second 2.4E+05
side note: I'm inputting ~1000 reads instead of the ~1500 used in the first test I presented in an earlier thread because I was tired of waiting for 15 minutes; I swear I'm generally a patient person):
This was using the default settings, and I'm curious if the outcome of about 2.5 minutes per 1000 reads is fitting into the realm of 'expected' now? I'd greatly appreciate any advice on how to best tackle the 5 million sequences to base call; perhaps there are particular parameters within Bonito to specify? At the current rate, it would take about 9 days to process all the data, I think? Alternatively, perhaps it makes sense to utilize the many K80 GPU devices I have available on the university HPC and spread out my data to base call with separate submission requests.
If it makes any difference, I have duplicate data sets with .fast5 files bundled into 10,000 sequences per .fast5, as well as in a single file format with each sequence as it's own .fast5 file. Not sure about how that impacts the base calling process (if at all).
Thanks for your support and advice (and of course, for the base calling software itself!)
Oh good, I suspected k80's given the previous performance and while 2.5+05 is better, it's still almost 10X slower than what should be possible on a V100 (with the default parameters). You will see the best performance when using multi-read files. The next thing I would check is the bandwidth between the storage and compute node. Is the data on a slow NFS share for example? Can you run a small test with some reads copied to local storage?
I'm not sure if my files are available to local storage (but I think I did, see below). What would be the best way to determine that?
This bit is from our resources page describing the cluster's hardware:
Monsoon is a capacity-type, Linux-based computer cluster with 2860 Intel Xeon cores, 24TB of memory, and 20 NVIDIA GPUs: K80, P100, and V100. It has been designed to be flexible and handle a diverse set of research requirements. 104 individual systems are interconnected via FDR InfiniBand at a rate of 56Gbps and <.07us latency. Cluster nodes have access to 1.3PB of shared storage of type scratch (lustre), and long-term project space (ZFS). Monsoon has a measured peak CPU performance of 107 teraflops.
The files used in this test were put into a directory named /scratch
(technically /scratch/dro49/bonito_test
...). I believe this is the local storage directory you were referring to. From another resource page describing cluster storage:
Storage:
/scratch
450TB
This is the primary shared working storage
Write/read your temporary files, logs, and final products here
30 day retention period on files, emails sent at 28 days for warning – no quotas
/tmp
120GB – Local node storage
I thought that perhaps /scratch
was the local spot to copy files, but perhaps it is instead /tmp
?
Thank you for any recommendations as to where it makes the most sense to copy these reads to. I think my next step is to compare what happens when the program copies reads to /tmp
(assuming I can get that to work).
Cheers
Best to test both locations @devonorourke to compare against the baseline ~2.4E+05
you have already seen on V100s.
Hi @iiSeymour ,
I managed to test copying the files to the local /tmp
directory. The small test batch was processed slightly faster, but not on the order of a 10x improvement as hoped. A brief summary of the 989 completed reads:
/scratch | /tmp | |
---|---|---|
duration | 0:02:25 | 0:02:20 |
samples s-1 | 2.4E+05 | 2.5E+05 |
Not sure if there are any other methods to improve performance, but happy to learn from your recommendations.
Cheers