bambu Pacbio errors creating se object

Hi!

I have downloaded my genome assembly from here : http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

and I have tried these gtf files:

https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.annotation.gtf.gz
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.chr_patch_hapl_scaff.annotation.gtf
http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.chr_patch_hapl_scaff.gtf

to use with the mapped output from IsoSeq3 analysis from Pacbio. I have a single mapped.bam of pooled samples and I have separate sample bam files that have been sequenced separately.

Here are the commands I used: fa.file <- "hg38.fa" test.bam <- BamFile("my.bam") bambuAnnotations <- prepareAnnotations("gencode.v39.chr_patch_hapl_scaff.annotation.gtf") *this one worked for the pooled sample.

Here is a snippet of my bam file and both bam look the same here: ´´´ @HD VN:1.5 SO:coordinate pb:3.0.1 @SQ SN:chr1 LN:248956422 @SQ SN:chr2 LN:242193529 ... @SQ SN:chrX LN:156040895 @SQ SN:chrY LN:57227415 @SQ SN:chrM LN:16569 @SQ SN:GL000008.2 LN:209709 @SQ SN:GL000009.2 LN:201709 @SQ SN:GL000194.1 LN:191469 @SQ SN:GL000195.1 LN:182896 @SQ SN:GL000205.2 LN:185591 @SQ SN:GL000208.1 LN:92689 @SQ SN:GL000221.1 LN:155397 @SQ SN:GL000224.1 LN:179693 @SQ SN:GL000225.1 LN:211173 @SQ SN:GL000226.1 LN:15008 @SQ SN:KI270302.1 LN:2274 @SQ SN:KI270303.1 LN:1942 @SQ SN:KI270304.1 LN:2165 @SQ SN:KI270305.1 LN:1472 @SQ SN:KI270310.1 LN:1201 ´´´

If I check the seqlevels(seqinfo(BamFile(test.bam))) compared to seqlevels(bambuAnnotations), I find that not all of the mapped seqlevels(seqinfo(BamFile(test.bam))) are in seqlevels(bambuAnnotations). My bam seqlevels(seqinfo(test.bam)) goes up to "KI270757.1" but the annotation goes up to "KZ559116.1"

Strangely, seqlevels(seqinfo(test.bam)) is the same for the pooled and individual bam yet when I try to run the same commands with the individual samples I get this error:

Error in validObject(object) : invalid class “GRanges” object: 'seqlevels(seqinfo(x))' and 'levels(seqnames(x))' are not identical In addition: Warning message: In calculateNDROnTranscripts(rowDataCombined) : Less than 50 TRUE or FALSE read classes for precision stabilization. Filtering by prediction score instead

Can you help identify the reason for this error and ss there a way to work around it? Or is there something I am missing here ?

Thanks!

Feb 28 '22 08:02 AlinaO90

Hi,

Since I don't have the bam file its a bit harder for me to troubleshoot, but I will do my best to try and help!

Homo_sapiens.GRCh38.105.chr_patch_hapl_scaff.gtf wouldn't work as it has different scaffold names to the fasta file.

The other gtf files do not seem to have this issue and look fine on my end. Another potential cause is if the bam file and the fasta files do not agree. Could you please check for me if there are seqlevels in the bam file that are missing in the genome file?

Below is how to get the seqlevels for the genome to compare.

genome = "hg38.fa"
genomeSequence <- checkInputSequence(genome)
seqlevels(genomeSequence)

If there are seqlevels in the bam that are not in the fasta, then you need to find the exact fasta that was used to do the mapping. If the fasta contains them all, let me know and I will try troubleshoot further.

-Andre

Mar 01 '22 09:03 andredsim

Hi! Thanks for your response. I am not able to find the function checkInputSequence() in this package and I checked a few other libraries as well. Do you have a name for the package that function is from?

In any case if I check the seqlevels in the individual bam files, they are the same as the labeled chromosomes when I just grep '>' the .fa file in the terminal.

If I check that bam seqlevels match gtf seqlevels for the one that works (below) its the same as the bam that does not work.

seqlevels(test.bam) == seqlevels(bambuAnnotations) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [55] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [91] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [127] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [163] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Thanks!

Mar 04 '22 14:03 AlinaO90

Hi,

Sorry about the confusion. checkInputSequence() is an internal function of Bambu so you can run it like so.

library(bambu) genome = "genome.fa" genomeSequence <- bambu:::checkInputSequence(genome) seqlevels(genomeSequence)

Let me know how that looks, and hopefully we can solve it!

Mar 07 '22 01:03 andredsim

No worries. I went back to double check the genome file I used originally for mapping with IsoSeq3 and it was from Ensembl [https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh38.primary_assembly.genome.fa.gz] so I use their gtf [gencode.v39.annotation.gtf] which works with the pooled .bam but not the individually sequenced samples.

bamflies

seqlevels(test.bam3) [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8"
[9] "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
[17] "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
[25] "chrM" "GL000008.2" "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000208.1" "GL000213.1" [33] "GL000214.1" "GL000216.2" "GL000218.1" "GL000219.1" "GL000220.1" "GL000221.1" "GL000224.1" "GL000225.1" [41] "GL000226.1" "KI270302.1" "KI270303.1" "KI270304.1" "KI270305.1" "KI270310.1" "KI270311.1" "KI270312.1" [49] "KI270315.1" "KI270316.1" "KI270317.1" "KI270320.1" "KI270322.1" "KI270329.1" "KI270330.1" "KI270333.1" [57] "KI270334.1" "KI270335.1" "KI270336.1" "KI270337.1" "KI270338.1" "KI270340.1" "KI270362.1" "KI270363.1" [65] "KI270364.1" "KI270366.1" "KI270371.1" "KI270372.1" "KI270373.1" "KI270374.1" "KI270375.1" "KI270376.1" [73] "KI270378.1" "KI270379.1" "KI270381.1" "KI270382.1" "KI270383.1" "KI270384.1" "KI270385.1" "KI270386.1" [81] "KI270387.1" "KI270388.1" "KI270389.1" "KI270390.1" "KI270391.1" "KI270392.1" "KI270393.1" "KI270394.1" [89] "KI270395.1" "KI270396.1" "KI270411.1" "KI270412.1" "KI270414.1" "KI270417.1" "KI270418.1" "KI270419.1" [97] "KI270420.1" "KI270422.1" "KI270423.1" "KI270424.1" "KI270425.1" "KI270429.1" "KI270435.1" "KI270438.1" [105] "KI270442.1" "KI270448.1" "KI270465.1" "KI270466.1" "KI270467.1" "KI270468.1" "KI270507.1" "KI270508.1" [113] "KI270509.1" "KI270510.1" "KI270511.1" "KI270512.1" "KI270515.1" "KI270516.1" "KI270517.1" "KI270518.1" [121] "KI270519.1" "KI270521.1" "KI270522.1" "KI270528.1" "KI270529.1" "KI270530.1" "KI270538.1" "KI270539.1" [129] "KI270544.1" "KI270548.1" "KI270579.1" "KI270580.1" "KI270581.1" "KI270582.1" "KI270583.1" "KI270584.1" [137] "KI270587.1" "KI270588.1" "KI270589.1" "KI270590.1" "KI270591.1" "KI270593.1" "KI270706.1" "KI270707.1" [145] "KI270708.1" "KI270709.1" "KI270710.1" "KI270711.1" "KI270712.1" "KI270713.1" "KI270714.1" "KI270715.1" [153] "KI270716.1" "KI270717.1" "KI270718.1" "KI270719.1" "KI270720.1" "KI270721.1" "KI270722.1" "KI270723.1" [161] "KI270724.1" "KI270725.1" "KI270726.1" "KI270727.1" "KI270728.1" "KI270729.1" "KI270730.1" "KI270731.1" [169] "KI270732.1" "KI270733.1" "KI270734.1" "KI270735.1" "KI270736.1" "KI270737.1" "KI270738.1" "KI270739.1" [177] "KI270740.1" "KI270741.1" "KI270742.1" "KI270743.1" "KI270744.1" "KI270745.1" "KI270746.1" "KI270747.1" [185] "KI270748.1" "KI270749.1" "KI270750.1" "KI270751.1" "KI270752.1" "KI270753.1" "KI270754.1" "KI270755.1" [193] "KI270756.1" "KI270757.1"

seqlevels(test.bam3) == seqlevels(test.bam) * 'test.bam' is the the bam of the pooled samples and works. 'test.bam3' is one of the individually sequenced samples and does not work. [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [45] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [89] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [111] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [133] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [155] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [177] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

genome fa

(Using your code)

genome <- "originalRef/GRCh38.primary_assembly.genome.fa" genomeSequence <- bambu:::checkInputSequence(genome) seqlevels(genomeSequence) [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8"
[9] "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
[17] "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
[25] "chrM" "GL000008.2" "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000208.1" "GL000213.1" [33] "GL000214.1" "GL000216.2" "GL000218.1" "GL000219.1" "GL000220.1" "GL000221.1" "GL000224.1" "GL000225.1" [41] "GL000226.1" "KI270302.1" "KI270303.1" "KI270304.1" "KI270305.1" "KI270310.1" "KI270311.1" "KI270312.1" [49] "KI270315.1" "KI270316.1" "KI270317.1" "KI270320.1" "KI270322.1" "KI270329.1" "KI270330.1" "KI270333.1" [57] "KI270334.1" "KI270335.1" "KI270336.1" "KI270337.1" "KI270338.1" "KI270340.1" "KI270362.1" "KI270363.1" [65] "KI270364.1" "KI270366.1" "KI270371.1" "KI270372.1" "KI270373.1" "KI270374.1" "KI270375.1" "KI270376.1" [73] "KI270378.1" "KI270379.1" "KI270381.1" "KI270382.1" "KI270383.1" "KI270384.1" "KI270385.1" "KI270386.1" [81] "KI270387.1" "KI270388.1" "KI270389.1" "KI270390.1" "KI270391.1" "KI270392.1" "KI270393.1" "KI270394.1" [89] "KI270395.1" "KI270396.1" "KI270411.1" "KI270412.1" "KI270414.1" "KI270417.1" "KI270418.1" "KI270419.1" [97] "KI270420.1" "KI270422.1" "KI270423.1" "KI270424.1" "KI270425.1" "KI270429.1" "KI270435.1" "KI270438.1" [105] "KI270442.1" "KI270448.1" "KI270465.1" "KI270466.1" "KI270467.1" "KI270468.1" "KI270507.1" "KI270508.1" [113] "KI270509.1" "KI270510.1" "KI270511.1" "KI270512.1" "KI270515.1" "KI270516.1" "KI270517.1" "KI270518.1" [121] "KI270519.1" "KI270521.1" "KI270522.1" "KI270528.1" "KI270529.1" "KI270530.1" "KI270538.1" "KI270539.1" [129] "KI270544.1" "KI270548.1" "KI270579.1" "KI270580.1" "KI270581.1" "KI270582.1" "KI270583.1" "KI270584.1" [137] "KI270587.1" "KI270588.1" "KI270589.1" "KI270590.1" "KI270591.1" "KI270593.1" "KI270706.1" "KI270707.1" [145] "KI270708.1" "KI270709.1" "KI270710.1" "KI270711.1" "KI270712.1" "KI270713.1" "KI270714.1" "KI270715.1" [153] "KI270716.1" "KI270717.1" "KI270718.1" "KI270719.1" "KI270720.1" "KI270721.1" "KI270722.1" "KI270723.1" [161] "KI270724.1" "KI270725.1" "KI270726.1" "KI270727.1" "KI270728.1" "KI270729.1" "KI270730.1" "KI270731.1" [169] "KI270732.1" "KI270733.1" "KI270734.1" "KI270735.1" "KI270736.1" "KI270737.1" "KI270738.1" "KI270739.1" [177] "KI270740.1" "KI270741.1" "KI270742.1" "KI270743.1" "KI270744.1" "KI270745.1" "KI270746.1" "KI270747.1" [185] "KI270748.1" "KI270749.1" "KI270750.1" "KI270751.1" "KI270752.1" "KI270753.1" "KI270754.1" "KI270755.1" [193] "KI270756.1" "KI270757.1"

So I am not sure what could be the issue. Thanks for your help!!

Mar 07 '22 15:03 AlinaO90

Hi,

Ok this all looks good and shouldn't cause the issue. I propose two options.

Could you send me test.bam3 so I could troubleshoot it from my end. You can email me at andre_sim[at]gis.a-star.edu.sg and we can discuss how to best transfer it.
Alternatively if you are unable to send the data I realized another potential cause. When rereading your first post I noticed one of the warnings "Less than 50 TRUE or FALSE read classes for precision stabilization.". Assuming you have over 1000 reads and are using the correct annotations, this should never happen. This warning only appears if either every single read is perfect, or none of the reads match any annotations. We recently discovered a bug that causes this issue, where Bambu was falsely labeling real read classes as not matching annotations for some reference annotations. The fix for this has been merged into the Master branch a few days ago. You could try rerunning your sample with the latest version from GitHub. Be sure not to load the Bioconductor version as that will not have the bug fix in it yet.

library(devtools)
load_all("path/to/bambu")

Mar 08 '22 02:03 andredsim

Unfortunately as these are patient samples and I am in the EU I think I am not able to share the bam themselves.
I reinstalled bambu from Github, is this version correct "bambu_2.1.6"? It still gives me the error below for the individual bam:

´´´ Start generating read class files | | 0%not all chromosomes present in reference annotations, annotations might be incomplete. Please compare objects on the same reference Junction correction with not enough data, precalculated model is used Not enough data points Transcript model not trained. Using pre-trained models |=======================================================================================| 100%

Finished generating read classes from genomic alignments. |=======================================================================================| 100%

|=======================================================================================| 100%

Less than 50 TRUE or FALSE read classes for precision stabilization. Filtering by prediction score instead No novel transcripts meet the given thresholds Error in validObject(object) : invalid class “GRanges” object: 'seqlevels(seqinfo(x))' and 'levels(seqnames(x))' are not identical ´´´

But I understand there is only so much you can do without the actual .bam so its fine, I can just move on if there is nothing that you can think of that might cause this. Thank you =)

Mar 10 '22 13:03 AlinaO90

Hi, no problems that you cannot send the .bam file, I understand there are often limitations on the data.

I do want to do my best to solve your issue though, especially because it might be an issue we want to fix in the code itself. Are you able to answer any of the follow questions within the constraints of your data privacy rules?

How many reads does this individual bam file have?
How does this compare to the combined bam file that works?
Are you running bambu with rcOutDir = "output/dir"? and does it produce an .rds file in this location?
Can you try running Bambu with and NDR of 1. bambu(reads = sample, annotations = annotations, genome = fa.file, NDR = 1) and let me know what it outputs. The NDR is the novel discovery rate which by default is at a conservative 0.1. I see that your previous output says "no novel transcripts meet the given thresholds" so if we set it to 1, which is the most sensitive we can be (assumes all transcript models are real), this may circumvent the issue.

Mar 11 '22 04:03 andredsim

The individual .bam has 4600 reads using ( samtools view -c test.bam3)
The combined bam file has 47096 using same command
I added it to my new command and now it is producing an .rds file
With NDR =1 it does seem to complete quantification successfully! Although there is a Warning or error output: ´´´ Less than 50 TRUE or FALSE read classes for precision stabilization. Filtering by prediction score instead ´´´ However, if I go to plot as described by one of your colleagues here: https://github.com/GoekeLab/bambu/issues/239 I can get type = "annotation" to work

 plotBambu(se1, group.variable = NULL,
          type = 'annotation',
          gene_id = 'ENSG00000101981.12', transcript_id = NULL)

but not the options 'pca' or 'heatmap'.

 plotBambu(se1, group.variable = NULL,
          type = 'pca',
          gene_id = 'ENSG00000101981.12', transcript_id = NULL)

Error in quantile.default(apply(count.data, 1, sd), 0.5) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE

Really great to get to this point at least,but do you have any ideas about how to resolve that last error? Thanks a lot!

Mar 14 '22 10:03 AlinaO90

Hi, great to hear this works.

Just a few comments first. 4600 reads is very low, we never tested at this limit and I imagine there are very few transcripts which have greater than 2 read support (the minimum) which might explain the the errors encountered. I would recommend being very cautious interpreting this data.

NDR of 1 means that every observed read (with unique exon junctions) is being treated as a new transcript. Potentially you could try lowering this value to find an optimum point between 0.1 and 1 which runs and will give you more confidence, precision in the novel transcripts. If you are only trying to visualize the potential reads in this sample however you can ignore this point.

Looking at the error message from the plotBambu function it seems to be that there are na values in the estimates. I am not too sure how these could be appearing, does this occur in the combined bam too? I believe a simple work around would be removing the transcripts that have no CPM measurement. You can view this with assays(se)$CPM and then remove them with se.temp = se[!is.na(assays(se)$CPM),] or !is.nan() and then try running the the plotBambu function again with se.temp.

Hope this helps but do let me know if it doesn't solve the issue.

Mar 15 '22 02:03 andredsim

Hi again!

I tried your suggestion and sadly still not working. Using the below commands: ´´´ se1a <-bambu(reads = combined.bam, annotations = bambuAnnot, genome = fa.file, NDR=1, rcOutDir = "outFiles/", discovery = TRUE) se1b <-bambu(reads = combined.bam, annotations = bambuAnnot, genome = fa.file, NDR=1, rcOutDir = "outFiles/", discovery = TRUE, stranded=TRUE, opt.discovery = list(min.readFractionByGene = 0.05)) se2 <-bambu(reads = individual.bam, annotations = bambuAnnot, genome = fa.file, NDR=1, rcOutDir = "outFiles/", discovery = TRUE, stranded=TRUE, opt.discovery = list(min.readFractionByGene = 0.05)) ´´´

The strange thing is that when I check which are na: ´´´ which(is.na(assays(se1a)$counts)) integer(0) which(is.na(assays(se1b)$counts)) integer(0) which(is.na(assays(se2)$counts)) integer(0) ´´´

But I looked a bit further into the R code it itself. The problem seems to be with this part of the code (in plotBambu.R): ´´´ count.data <- count.data[apply(count.data, 1, sd) > quantile(apply(count.data, 1, sd), 0.50), ] ´´´

When I try this code with both the test data that is written in the main gitHub page here and with my own data, I get the same error as I wrote about above. ´´´ Error in quantile.default(apply(count.data, 1, sd), 0.5) : missing values and NaN's not allowed if 'na.rm' is FALSE ´´´

I will see if there is some other way for me to transform the data but, thank you again for helping me troubleshoot =)

Mar 23 '22 08:03 AlinaO90

Hi, sorry for the late reply I was away the last week. I think I have realized the problem. The pca and heatmap methods for plotBambu are only for multi sample analysis (comparing the samples to each other). Each of your runs is only using one bam file (1 sample per bam), so it can't make a PCA plot with only 1 column. To do multiple sample analysis you need to pass multiple bam files to the reads command like below. se <-bambu(reads = c(1.bam, 2.bam), annotations = bambuAnnot, genome = fa.file, NDR=1, rcOutDir = "outFiles/", discovery = TRUE) You could try this with all your individual bams, but the caveat remains regardingly their low sequencing depth. I hope this helps!

Mar 30 '22 03:03 andredsim

Hi,

Good news. I was able to get a dataset that produced your original error 'seqlevels(seqinfo(x))' and 'levels(seqnames(x))' are not identical and was able to track down what was causing it and I am implemented a bug fix which you can access on the no_novel_annotations_crash branch. We will work to get this merged into the master branch as soon as possible, but I believe with this change you should be able to run your individual pacbio samples again. Let me know if this works for you so I can close this issue. I hope this helps!

(If you are still having issue with the plotting please open a new issue for that)

Apr 06 '22 02:04 andredsim