CHM13 icon indicating copy to clipboard operation
CHM13 copied to clipboard

guppy v6

Open aafshinfard opened this issue 1 year ago • 16 comments

Just wanted to ask if there are any plans on releasing a guppy >= v6 base calling of the reads? Thanks.

aafshinfard avatar Jul 27 '22 18:07 aafshinfard

No immediate plans since we're not actively working on CHM13 and we've not found much benefit going to guppy 6+ with our hybrid assembly method.

skoren avatar Aug 01 '22 20:08 skoren

Thanks for the response @skoren

aafshinfard avatar Aug 02 '22 06:08 aafshinfard

Given that I recently downloaded the whole raw signal dataset, I am planning to do a Guppy 6 rebasecall. If it succeeds (and not sure how much time it will take) and if your AWS storage can host more data @skoren , I can share it to be shared.

hasindu2008 avatar Aug 11 '22 03:08 hasindu2008

@hasindu2008 That would be awesome!

aafshinfard avatar Aug 11 '22 15:08 aafshinfard

@aafshinfard I have recently converted all the raw data to bloe5 format and have basecalled using Guppy 6.1.3 hac model. Given the large size of the files, I am not sure how I could share, Any suggestions?

hasindu2008 avatar Aug 29 '22 14:08 hasindu2008

@hasindu2008 Nice to hear you did it. How large are the files?

aafshinfard avatar Aug 29 '22 17:08 aafshinfard

@hasindu2008 Would be nice if the T2T team can host this (@skoren), but another option would be Zenodo. I heard they support up to 50GB and even more in special cases... https://www.youtube.com/watch?v=S1qK_TA52e4&t=251s

aafshinfard avatar Aug 29 '22 17:08 aafshinfard

@aafshinfard how big is the total file size?

arangrhie avatar Aug 29 '22 17:08 arangrhie

@arangrhie, I opened the issue and @hasindu2008 kindly did the job; waiting for them to respond about the size of the dataset.

aafshinfard avatar Aug 31 '22 19:08 aafshinfard

@arangrhie @aafshinfard

The basecalled fastq files gzipped are relatively small and I think can be easily hosted. 288G hg2_merged_pass.fastq.gz 39G hg2_merged_fail.fastq.gz

The raw signal data converted to BLOW5 are 3.4 TB. I had to convert that 5TB+ FAST5 compressed tarballs to BLOW5; otherwise, base-calling using FAST5 would have taken a few weeks. It would be useful for the future if those BLOW5 can be hosted to allow direct base-calling from S3 storage mounted locally, as well as partial download of certain genomic regions when necessary (see #63). Compressed tarballs of FAST5 for this kind of large dataset is not easily accissible and diminishes the value of a useful dataset like this in my opinion.

hasindu2008 avatar Sep 01 '22 04:09 hasindu2008

@aafshinfard You may download the merged Guppy 6 basecalls for the whole dataset here:

https://slow5test.s3.amazonaws.com/tmp/chm13_merged_pass.fastq.gz https://slow5test.s3.amazonaws.com/tmp/chm13_merged_fail.fastq.gz

Note that this is not a free S3 storage like the one used for hosting CHM13, so I will be grateful if you can let me know after you download it so that I can delete it then. Otherwise, AWS keeps on charging.

@skoren CHM13 maintainers feel free to copy this file into their free S3 storage if you think it will be useful to anyone in future.

Software and versions used for the basecalling are explained below: Nanopore raw signal data were downloaded, extracted and then converted to BLOW5 format using slow5tools. Then, they were basecalled using buttery-eel under Guppy 6.3.7 high accuracy mode. Qscore 7 was used for pass and fail cut-off.

Base-calling commands:

#basecall gridION data

buttery-eel  -i  min_grid.blow5  --guppy_bin /install/ont-guppy-6.3.7/bin/  --config dna_r9.4.1_450bps_hac.cfg -x cuda:all -q 7 -o reads_min_grid.fastq --port 5555  --use_tcp

#basecall promethION data
buttery-eel  -i  prom.blow5  --guppy_bin /install/ont-guppy-6.3.7/bin/  --config dna_r9.4.1_450bps_hac_prom.cfg -x cuda:all -q 7 -o reads_prom.fastq --port 5556  --use_tcp

hasindu2008 avatar Nov 14 '22 05:11 hasindu2008

@hasindu2008 Awesome, thank you so much!

aafshinfard avatar Nov 15 '22 04:11 aafshinfard

@hasindu2008 Just started downloading; should be done tonight. Will confirm after it has finished. Thanks again.

aafshinfard avatar Nov 23 '22 01:11 aafshinfard

@hasindu2008 Just confirming that my download was completed. Thank you so much for your help.

aafshinfard avatar Nov 28 '22 23:11 aafshinfard

@aafshinfard No problem, glad to help. If this becomes useful in your work please consider citing BLOW5 which allowed us to do this basecalling with very little budget, which otherwise would require to spend a fortune.

hasindu2008 avatar Nov 29 '22 23:11 hasindu2008

Sure thing, thank you @hasindu2008

aafshinfard avatar Nov 29 '22 23:11 aafshinfard