bambu icon indicating copy to clipboard operation
bambu copied to clipboard

Does the Bam file need to be sorted or unsorted?

Open niradsp opened this issue 4 years ago • 9 comments

Hello, So for example the EM algorithm used by rsem requires unsorted tramscriptome BAM file. Does Bambu have any such requirement? Also, I have 8 nanopore files, all compressed, sorted and indexed. Each file is a cDNA file that were compressed, sorted and indexed. Also, we did multiple runs. So each file is maybe 20 GB in size.
I was wondering how much memory and threads will be required to finish it within 24 hours. We will have 50 files in total coming soon. I would like to test using these 8 files to see what kind of data is produced. We also have Illumina data, and I would like to compare.

niradsp avatar Aug 29 '21 08:08 niradsp

Hi @niradsp , yes, the bam file need to be sorted and indexed genome bam file. So you can apply your dataset with bambu directly. As to what is the expected memory and threads, my suggestion would be to set a relatively small yieldSize, say 1000000, and a memory of 64/128GB and 8/10 threads would be sufficient. If it does not work, I would probably suggest you to just run the bambu for two times, first time to get all the readclass files processed, and saved to prespecified rcOutDir, and the second time to proceed with the rest.

If you still cannot get it run, you may try our working development branch bambu-dev-bioc, which should be more memory and time efficient. But the problem with this branch is that it is not finalized yet, and we are still actively working on this version. We are aiming to release in a few weeks' time. If you are concerned, you can just wait until we release the new version.

Thank you!

cying111 avatar Sep 01 '21 06:09 cying111

Thank you @cying111 for the response. I will try it as you suggested very soon, and if that does not work, I will try out the development branch. I don't mind waiting for the new version. I am excited to test it out. Also, have you tested out on Promethion data? It tends to be bigger. We will be moving from Minion to Promethion very soon. Furthermore, the 50-sample data has been sequenced using Illumina as well. Does Bambu do splice site correction using Illumina data as well (Like Flair). Stringtie now allows correction using Illumina data along with Annotation GTF file. If not, is this something that will be added in the future? Thank you so much. Nirad

niradsp avatar Sep 01 '21 11:09 niradsp

yes, we have tested on Promethion datasets on the development branch. You can try with current version first, if not able to run, you can just switch to the development branch or wait for the version.

We don't have the junction splice site correction for Illumina data for now. But yes, that is a good idea, we definitely want to have that as a feature in bambu! Thanks for the suggestions!!

cying111 avatar Sep 02 '21 12:09 cying111

Thank you for your help. I must say that this is a very well-designed program and finished running within 2 hours. I have been looking at a particular gene and its isoforms. The isoforms have showed differential usage in Illumina data, and in fact, when I ran it using FLAIR as well as stringtie, the isoforms show up. However, Bambu is showing a different isoform for the secondary isoform. In other words, the primary one was the same as the first one, but the secondary one was different. In both cases, I am using Illumina data for correction (in case of Stringtie, I am using the --mix option). (Actually, even Talon is showing the same pair of isoforms--this one does not use Illumina data).

Are there maybe any parameters that I can tweak? The BAM files also matche the other programs. Bambu is suggesting an intron retention event, which I am not seeing at all. Thanks in advance.

niradsp avatar Sep 08 '21 22:09 niradsp

Hi @niradsp thanks for letting us know the performance of bambu, this is very helpful. I probably know what could be the issue and we fixed it in the development branch, I would probably suggest you to try that branch and see if there is any improvement for that particular gene, and if it persists, you can try to change this parameter in the opt.discovery argument by setting opt.disocvery = list(min.primarySecondaryDistStartEnd2 = 100000) and see.

cying111 avatar Sep 13 '21 04:09 cying111

Hello @cying111 , Thank you for the explanation. I downloaded the development version, and I ran it the way that you suggested. Are there any other tweaks I can make? There are a couple of other genes that are not showing up (that are showing up in the stringtie data). Also, how well has Bambu been tested on Pabio data? It seems like I will also be getting PACBIO data.

Thank you in advance, Nirad

niradsp avatar Sep 27 '21 21:09 niradsp

Hi @niradsp Yes, we have tested it on PacBio data, and it worked pretty well. For those genes that are not showing up, are they novel genes? If yes, you can increase the max.txDNR parameter in the opt.discovery argument, for example to 0.2, or even 0.5 to see. This parameter controls the threshold that we used tof filter novel transcripts.

cying111 avatar Oct 04 '21 01:10 cying111

Hi, Sorry for the late response . We are finally getting our data. The isoform in question is not novel. It is quite detectable via Illumina, and Flair found it as well.
Bambu though is suggesting a cryptic exon (which results in Nonsense mediated decay).
We will see, but maybe with 42 samples, it shows up. Should I run a program such as TranscriptClean for junction correction? I have Illumina data as well. Thanks P.S. The difference between the two isoforms (that should show up) appears to be in the first exon. I think even there that there is an overlap.

Edit: I added a clarification. I thought the isoform was an intron retention event due to the fact that it was marked as "Nonsense Mediated Decay". Upon further examination, it appears to be a retention of a cryptic exon.

Is there an email that I can reach you? I can provide you with more information. I tried turning off the subset filtering but to no avail.

niradsp avatar Jan 25 '22 03:01 niradsp

Hi @niradsp,

Glad that you finally got your data. For the cryptic exon isoform, you may check if there are full length or unique support for further evidence before more samples come in. For junction correction, bambu alreadty performs junction correction when processing the reads, so I probably would not expect much improvement from TranscriptClean, though I am not very sure about this as we didn't try TranscriptClean ourself.

Sure, if you'd like, we can further discuss this through email, here is my email address: [email protected]

cying111 avatar Jan 27 '22 00:01 cying111