goseq icon indicating copy to clipboard operation
goseq copied to clipboard

Inconsistent nullp and goseq enrichment results across operating systems

Open daniellembecker opened this issue 8 months ago • 1 comments

Hello,

We are trying to find the root issue of why we are having differing final output results from using the same datasets, parameters, functions, R (4.4.1), RStudio ("2024.04.2+764"), and goseq (1.56.0) package versions. Using the code below, I have tried to replicate the results with the exact same cloned repository on multiple computers and operating systems while checking that all steps are identical when running. However, with this code I am receiving 31 significantly enriched terms, while when someone else clones the repository or the code is used on different operating systems we see there are no significantly enriched terms. This is clearly a major issue and we have tried multiple steps to solve the issue as updating all versions to be the same, trying on multiple systems, cloning and recloning the repo, updating the repo and confirming it has been pushed to Github correctly but we have run out of troubleshooting options. Any ideas or help would be much appreciated!

Generate vector with names of all genes detected in our dataset

ALL.vector <- c(filtered_Pverr.annot$gene_id)

Generate length vector for all genes

LENGTH.vector <- as.integer(filtered_Pverr.annot$length)

Generate vector with names in just the contrast we are analyzing

ID.vector.down <- DEG.res %>% filter(direction=="Downregulated") %>% pull(gene_id)

##Get a list of GO Terms for all genes detected in our dataset GO.terms <- filtered_Pverr.annot%>% dplyr::select(gene_id, GO.ID)

##Format to have one goterm per row with gene ID repeated split <- strsplit(as.character(GO.terms$GO.ID), ";") split2 <- data.frame(v1 = rep.int(GO.terms$gene_id, sapply(split, length)), v2 = unlist(split)) colnames(split2) <- c("gene_id", "GO.ID") GO.terms<-split2

##Construct list of genes with 1 for genes in contrast and 0 for genes not in the contrast gene.vector.down=as.integer(ALL.vector %in% ID.vector.down) names(gene.vector.down)<-ALL.vector#set names

#weight gene vector by bias for length of gene: load in the gene.vector(a list of all genes with 1 indicating it is a DEG in the group of interest and 0 meaning it is not a DEG) and bias.data (list of lengths for all genes) pwf.down<-nullp(gene.vector.down, bias.data=LENGTH.vector)

data(genes)

pwf <- nullp(genes, 'hg19', 'ensGene')

#run goseq using Wallenius method for all categories of GO terms: load in the pwf object (gene name, a 1 or 0 to indicate DEG, the length, and the pwf function; generated from pwf above), and the list of GO terms for all genes GO.wall.down<-goseq(pwf.down, gene2cat=GO.terms, test.cats=c("GO:BP", "GO:MF", "GO:CC"), method="Wallenius", use_genes_without_cat=TRUE)

GO.down <- GO.wall.down[order(GO.wall.down$over_represented_pvalue),]
colnames(GO.down)[1] <- "GOterm"

#adjust p-values 
GO.down$bh_adjust <- p.adjust(GO.down$over_represented_pvalue, method="BH") #add adjusted p-values

#Filtering for p < 0.05
GO.down <- GO.down %>%
    dplyr::filter(bh_adjust<0.05) %>%
    dplyr::arrange(., ontology, bh_adjust)

Output of data from MacOS Monterey version 12.6.6:

Screen Shot 2024-06-21 at 2 44 44 PM Screen Shot 2024-06-21 at 2 51 22 PM

Output of data from macOS Sonoma 14.5:

Screenshot 2024-06-21 at 11 47 53 Screenshot 2024-06-21 at 11 42 10

We also both ran a test dataset: data(genes) pwf <- nullp(genes, ‘hg19’, ‘ensGene’)

Which gave us both this same output: Screenshot 2024-06-21 at 11 58 56

daniellembecker avatar Jun 21 '24 20:06 daniellembecker