lower base quality and more indels than actual data
Hi, James.
I generated synthetic data using squigulator then basecalled with buttery-eel. There seems to be much more indels and lower base quality scores with the generated synthetic data than the actual data. Below is an IGV screenshot (upper panel: squiggulator+buttery-eel data; bottom panel:actual amplicon data sequenced in a R10 flowcell, lib prep kit NBD114, basecalled with SUP)
Here are the commands used to generate the synthetic data: `config=dna_r10.4.1_e8.2_400bps_sup.cfg
#create artificial datasets
time $squigulator -x dna-r10-prom -f ${d} -t 8 -r ${r} -q $outdir/${i}${d}"x"${r}ideal_${n}.fasta
--bps 400 --ont-friendly=yes $ref/${i}.fasta -o $datadir/${i}${d}"x"${r}_${n}.blow5
#basecall
time buttery-eel -g $basecaller --config $config --device cuda:1 -i $datadir/${i}${d}"x"${r}${n}.blow5 -o $outdir/${i}${d}"x"_${r}buttery-eel${n}.fastq
--port auto --use_tcp --dorado_download_path $dorado_download_path
`
The mean baseQs are 13.3 for the synthetic data and 34.8 for the actual data.
Thank you in advance for your help.
Hey,
If you converted the blow5 files to pod5 using the blue-crab converter, then ran with dorado-server or dorado, you would find you get a similar result.
This looks to be a Squigulator question rather than a basecalling one. I would ask @hasindu2008 over there about it.
https://github.com/hasindu2008/squigulator
Cheers.
Hi! I already showed this to Hasindu and Martin. They suggested training a pore model. just having some issues using the HPC at the moment but it's in my to-do list.
Thank you.
Le mer. 21 mai 2025, 00 h 22, James Ferguson @.***> a écrit :
Psy-Fer left a comment (Psy-Fer/buttery-eel#76) https://github.com/Psy-Fer/buttery-eel/issues/76#issuecomment-2896513309
Hey,
If you converted the blow5 files to pod5 using the blue-crab converter, then ran with dorado-server or dorado, you would find you get a similar result.
This looks to be a Squigulator question rather than a basecalling one. I would ask @hasindu2008 https://github.com/hasindu2008 over there about it.
https://github.com/hasindu2008/squigulator
Cheers.
— Reply to this email directly, view it on GitHub https://github.com/Psy-Fer/buttery-eel/issues/76#issuecomment-2896513309, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVK45YTAU5AR4HOPQZUBJOT27P5PVAVCNFSM6AAAAAB5R7UUOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOJWGUYTGMZQHE . You are receiving this because you authored the thread.Message ID: @.***>
Hi @npdungca
Can you move this issue to Squigulator? As @Psy-Fer mentioned, this has nothing to do with the basecaller buttery-eel. This is because the pore model used for the R10 data is of low quality. The pore model for squigulator is derived from the ONT provided R10 model, which does not have the standard deviation for generating noise. I derived these standard deviation values using a quick heuristic method, which is not too great.