Immune-Subtype-Clustering Different result between iAtlas Tools and notebook

Hi,

I detected the immune subtypes using iAtlas Tools and immune subtypes notebook for same data(ebpp_test1_1to20.tsv). But the finalResults are different . How can it be?

#####result of notebook###### 1 2 3 4 5 6 Call XY1 0.0000000 0.0000000 0.0312500 0.7265625 0.1562500 0.0859375 4 XY2 0.0000000 0.0000000 0.9375000 0.0000000 0.0234375 0.0390625 3 XY3 0.0078125 0.0000000 0.5781250 0.2578125 0.0312500 0.1250000 3 XY4 0.0078125 0.0156250 0.0000000 0.7500000 0.0390625 0.1875000 4 XY5 0.0000000 0.0078125 0.0625000 0.7109375 0.0937500 0.1250000 4 XY6 0.0078125 0.0234375 0.0000000 0.7500000 0.0000000 0.2187500 4 XY7 0.3046875 0.0625000 0.0000000 0.2109375 0.0000000 0.4218750 6 XY8 0.0234375 0.0000000 0.7968750 0.0156250 0.0390625 0.1250000 3 XY9 0.0156250 0.0234375 0.0000000 0.7812500 0.0234375 0.1562500 4 XY10 0.0156250 0.8984375 0.0000000 0.0156250 0.0000000 0.0703125 2 XY11 0.0078125 0.0156250 0.0000000 0.7500000 0.0078125 0.2187500 4 XY12 0.0000000 0.0000000 0.5000000 0.3828125 0.0781250 0.0390625 3 XY13 0.0000000 0.0000000 0.0000000 0.9531250 0.0078125 0.0390625 4 XY14 0.0078125 0.0156250 0.0000000 0.7656250 0.0156250 0.1953125 4 XY15 0.0000000 0.0000000 0.0000000 0.9453125 0.0156250 0.0390625 4 XY16 0.0078125 0.0000000 0.9453125 0.0000000 0.0078125 0.0390625 3 XY17 0.0390625 0.1875000 0.0000000 0.4687500 0.0000000 0.3046875 4 XY18 0.0156250 0.1171875 0.6250000 0.0546875 0.0000000 0.1875000 3 XY19 0.0078125 0.0156250 0.0000000 0.7812500 0.0390625 0.1562500 4 XY20 0.0468750 0.0234375 0.0000000 0.7031250 0.0000000 0.2265625 4

#####result of tools ######

Apr 17 '19 02:04 Tim0thy1

Hi there, Thank you for letting me know.

You ask "How can it be?" Oh, anything is possible. Is this your own data? Is it something I can download and try?

During the classification, the ensemble of classifiers makes a subtype call for each sample, then there's a step where all the calls must be merged. I think there's some randomness there, if a sample is on the border between clusters, it can bounce back and forth. I used the clue R package for that part.

Or it's a bug. I'll check it out. -dave

On Tue, Apr 16, 2019 at 7:43 PM Tim0thy1 [email protected] wrote:

Hi,

I detected the immune subtypes using iAtlas Tools and immune subtypes notebook. But the finalResults are different . How can it be?

#####result of notebook###### 1 2 3 4 5 6 Call XY1 0.0000000 0.0000000 0.0312500 0.7265625 0.1562500 0.0859375 4 XY2 0.0000000 0.0000000 0.9375000 0.0000000 0.0234375 0.0390625 3 XY3 0.0078125 0.0000000 0.5781250 0.2578125 0.0312500 0.1250000 3 XY4 0.0078125 0.0156250 0.0000000 0.7500000 0.0390625 0.1875000 4 XY5 0.0000000 0.0078125 0.0625000 0.7109375 0.0937500 0.1250000 4 XY6 0.0078125 0.0234375 0.0000000 0.7500000 0.0000000 0.2187500 4 XY7 0.3046875 0.0625000 0.0000000 0.2109375 0.0000000 0.4218750 6 XY8 0.0234375 0.0000000 0.7968750 0.0156250 0.0390625 0.1250000 3 XY9 0.0156250 0.0234375 0.0000000 0.7812500 0.0234375 0.1562500 4 XY10 0.0156250 0.8984375 0.0000000 0.0156250 0.0000000 0.0703125 2 XY11 0.0078125 0.0156250 0.0000000 0.7500000 0.0078125 0.2187500 4 XY12 0.0000000 0.0000000 0.5000000 0.3828125 0.0781250 0.0390625 3 XY13 0.0000000 0.0000000 0.0000000 0.9531250 0.0078125 0.0390625 4 XY14 0.0078125 0.0156250 0.0000000 0.7656250 0.0156250 0.1953125 4 XY15 0.0000000 0.0000000 0.0000000 0.9453125 0.0156250 0.0390625 4 XY16 0.0078125 0.0000000 0.9453125 0.0000000 0.0078125 0.0390625 3 XY17 0.0390625 0.1875000 0.0000000 0.4687500 0.0000000 0.3046875 4 XY18 0.0156250 0.1171875 0.6250000 0.0546875 0.0000000 0.1875000 3 XY19 0.0078125 0.0156250 0.0000000 0.7812500 0.0390625 0.1562500 4 XY20 0.0468750 0.0234375 0.0000000 0.7031250 0.0000000 0.2265625 4

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiSyQOyKUNRDmtEiTLY4ic5DRYHXdcUks5vhopLgaJpZM4c0E_M .

Apr 17 '19 22:04 Gibbsdavidl

Hi, No, this is example data. ( https://isb-cgc.shinyapps.io/shiny-iatlas/ ) or https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/blob/master/ExtraData/ebpp_test1_1to20.tsv

Apr 18 '19 01:04 Tim0thy1

I upload my R script there. I copy that code from there

############################R code#########################

library(ggplot2) library(clue) library(mclust) library(parallel) library(stringr)

tcgaSubset <- read.table('/home/data/ebppSubset.tsv.bz2', header = T, sep = '\t', stringsAsFactors = F)

#newdata

newDat <- read.table("/home/data/Immune-Subtype-Clustering-master/ExtraData/ebpp_test1_1to20.tsv", sep='\t', header=T, stringsAsFactors = F)

reportedScores <- read.table('~/data/five_signature_mclust_ensemble_results.tsv.gz', sep='\t', header=T, stringsAsFactors = F) rownames(reportedScores) <- reportedScores$AliquotBarcode

load('/home/data/comparative_immuneSigs_geneLists4.rda') source('/home/data/ImmuneSigs68_function.R') source('/home/data/signature_mclust_ensemble.R') load('/home/data/wolf_set_slim1.rda')

zscore.cols2<-function(x){ return((apply(x, 2, function(x) (x - median(na.omit(x)))/sd(na.omit(x))))) }

####Data prep ### #tcga data is already upper quantile normalized tcgaSubset <- log2(tcgaSubset + 1)

##here we'll drop any duplicate genes, and make the rownames ##be gene symbols rownames(newDat) <- newDat[,1] didx <- !duplicated(as.character(rownames(newDat))) dat <- newDat[didx,-1]

##TCGA data was upper quantile normalized (at the 75%) and multiplied by 1000 data.quantileExpressed <- apply(dat, 2, function(x){quantile(x[x>0], 0.75)}) datnorm <- as.data.frame(t( t(dat) / data.quantileExpressed ) ) * 1000

datlog2 <- log2(datnorm+1)

####joining data sets ### sharedGenes <- intersect(rownames(tcgaSubset), rownames(datlog2))

##first median scale each data set newDatSub <- datlog2[sharedGenes,]

##getting things in the right order tcgaSubsetSub <- tcgaSubset[sharedGenes,]

##then join the new data and tcga data at the genes joinDat <- cbind(newDatSub, tcgaSubsetSub)

##pre-batch corrected table, for plotting newdatSamples <- colnames(newDatSub) dat2idx <- 1:(ncol(newDatSub)-1) tcgaidx <- setdiff( (1: (ncol(joinDat)-1)), dat2idx)

sampleIdx <- c(dat2idx, sample(tcgaidx, size=200, replace = F)) preCombat <- joinDat[,sampleIdx] preCombatMelt <- reshape2::melt(preCombat) preCombatMelt$SampleSource <- ifelse(test = preCombatMelt$variable %in% newdatSamples, yes = "New Data", no="TCGA Data")

##batch correction #

library(sva) combatflag <- 0

if (combatflag) { # then batch correction between scores... batch <- c(rep(1,ncol(newDatSub)), rep(2,ncol(tcgaSubsetSub))) modcombat = model.matrix(~1, data=as.data.frame(t(joinDat))) combat_edata = ComBat(dat=joinDat, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE, ref.batch = 2) } else { combat_edata = joinDat
}

##then look at the distribution of the data after batch correction postCombat <- combat_edata[,sampleIdx] postCombatMelt <- reshape2::melt(postCombat) postCombatMelt$SampleSource <- ifelse(test = postCombatMelt$variable %in% newdatSamples, yes = "New Data", no="TCGA Data")

##getting the medians of each gene joinMeds <- apply(combat_edata, 1, median, na.rm=T) ##here we median center each gene join_med_scaled <- sweep(combat_edata,1,joinMeds,'-')

ls() rm(dat, dat2idx, data.quantileExpressed, datlog2, datnorm, didx, newDat, newdatSamples, postCombat, postCombatMelt, preCombat, preCombatMelt, sampleIdx, sharedGenes, tcgaidx, tcgaSubsetSub) gc()

library(pryr)

scores <- ImmuneSigs_function(join_med_scaled, sigs1_2_eg2,sigs12_weighted_means, sigs12_module_weights,sigs1_2_names2,sigs1_2_type2)

idx <- c("LIexpression_score", "CSF1_response", "TGFB_score_21050467", "Module3_IFN_score", "CHANG_CORE_SERUM_RESPONSE_UP") scores <- t(scores[idx,]) zscores <- zscore.cols2(scores)

##make cluster calls using the models. calls <- consensusEnsemble(mods2, zscores, 4, 128)

##get the top scoring cluster for each sample maxcalls <- apply(calls$.Data, 1, function(a) which(a == max(a))[1]) names(maxcalls) <- rownames(scores)

##then we'll look at the new vs. old cluster calls for TCGA samples sharedIDs <- intersect(reportedScores$AliquotBarcode, rownames(scores)) t1 <-table(Reported=as.numeric(reportedScores[sharedIDs, 'ClusterModel1']), NewCalls=as.numeric(maxcalls[sharedIDs]))

##then we can align the new calls to calls from the manuscript. reported <- 1:6 optcalls <- 1:6

for (i in reported) {

##for subtype i, where did most of the samples end up?
j <- which(as.numeric(t1[i,]) == max(as.numeric(t1[i,])))
##rename maxcall j <- i
optcalls[i] <- j

}

##these are the re-mapped calls alignedCalls <- sapply(maxcalls, function(a) which(a == optcalls)[1])

##make sure it works on the re-called TCGA data t2 <-table(Reported=as.numeric(reportedScores[sharedIDs, 'ClusterModel1']), NewCalls=as.numeric(alignedCalls[sharedIDs]))

##assemble the results jdx <- match(table=rownames(scores), x=colnames(newDatSub)) # index to new data scores pcalls <- calls$.Data[jdx,] # get that table rownames(pcalls) <- colnames(newDatSub) # name it from the new data pcalls <- pcalls[,optcalls]

pcalls <- cbind(pcalls, data.frame(Call=alignedCalls[jdx])) # bring in the aligned calls pcalls <- cbind(pcalls, zscores[jdx,]) # and the scores

finalResults <- list(AlignedCalls=alignedCalls[jdx], Table=t2, ProbCalls=pcalls)

##In the finalResults, ##$AlignedCalls are the immune subtypes - aligned with the TCGA manuscript ##$Table is the TCGA reported immune subtypes (on TCGA samples) compared to what we just computed ##$ProbCalls shows the probability of being in each subtype and signature scores, Call is the aligned call

load('/home/data/colnames_1to20_ebpp.rda') rownames(reportedScores) <- str_replace_all(reportedScores$AliquotBarcode, pattern = '\.', replacement = '-')

manuscriptCalls <- reportedScores[cx, 'ClusterModel1']

newCalls <- finalResults$AlignedCalls

save(finalResults,file="tmp.Robj")

Apr 18 '19 01:04 Tim0thy1

Hi there, Thank you for letting me know. You ask "How can it be?" Oh, anything is possible. Is this your own data? Is it something I can download and try? During the classification, the ensemble of classifiers makes a subtype call for each sample, then there's a step where all the calls must be merged. I think there's some randomness there, if a sample is on the border between clusters, it can bounce back and forth. I used the clue R package for that part. Or it's a bug. I'll check it out. -dave … On Tue, Apr 16, 2019 at 7:43 PM Tim0thy1 @.***> wrote: Hi, I detected the immune subtypes using iAtlas Tools and immune subtypes notebook. But the finalResults are different . How can it be? #####result of notebook###### 1 2 3 4 5 6 Call XY1 0.0000000 0.0000000 0.0312500 0.7265625 0.1562500 0.0859375 4 XY2 0.0000000 0.0000000 0.9375000 0.0000000 0.0234375 0.0390625 3 XY3 0.0078125 0.0000000 0.5781250 0.2578125 0.0312500 0.1250000 3 XY4 0.0078125 0.0156250 0.0000000 0.7500000 0.0390625 0.1875000 4 XY5 0.0000000 0.0078125 0.0625000 0.7109375 0.0937500 0.1250000 4 XY6 0.0078125 0.0234375 0.0000000 0.7500000 0.0000000 0.2187500 4 XY7 0.3046875 0.0625000 0.0000000 0.2109375 0.0000000 0.4218750 6 XY8 0.0234375 0.0000000 0.7968750 0.0156250 0.0390625 0.1250000 3 XY9 0.0156250 0.0234375 0.0000000 0.7812500 0.0234375 0.1562500 4 XY10 0.0156250 0.8984375 0.0000000 0.0156250 0.0000000 0.0703125 2 XY11 0.0078125 0.0156250 0.0000000 0.7500000 0.0078125 0.2187500 4 XY12 0.0000000 0.0000000 0.5000000 0.3828125 0.0781250 0.0390625 3 XY13 0.0000000 0.0000000 0.0000000 0.9531250 0.0078125 0.0390625 4 XY14 0.0078125 0.0156250 0.0000000 0.7656250 0.0156250 0.1953125 4 XY15 0.0000000 0.0000000 0.0000000 0.9453125 0.0156250 0.0390625 4 XY16 0.0078125 0.0000000 0.9453125 0.0000000 0.0078125 0.0390625 3 XY17 0.0390625 0.1875000 0.0000000 0.4687500 0.0000000 0.3046875 4 XY18 0.0156250 0.1171875 0.6250000 0.0546875 0.0000000 0.1875000 3 XY19 0.0078125 0.0156250 0.0000000 0.7812500 0.0390625 0.1562500 4 XY20 0.0468750 0.0234375 0.0000000 0.7031250 0.0000000 0.2265625 4 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiSyQOyKUNRDmtEiTLY4ic5DRYHXdcUks5vhopLgaJpZM4c0E_M .

Sorry, my English is not very good. I must have mistaken 'How can it be' meaning.

Apr 18 '19 01:04 Tim0thy1

Your english is good. I read your question as “how is that possible?” ... which is fine. They should give mostly the same answer.

But this project has been very difficult for me.

:-)

On Wed, Apr 17, 2019 at 6:33 PM Tim0thy1 [email protected] wrote:

Hi there, Thank you for letting me know. You ask "How can it be?" Oh, anything is possible. Is this your own data? Is it something I can download and try? During the classification, the ensemble of classifiers makes a subtype call for each sample, then there's a step where all the calls must be merged. I think there's some randomness there, if a sample is on the border between clusters, it can bounce back and forth. I used the clue R package for that part. Or it's a bug. I'll check it out. -dave … <#m_-4400710527188761948_> On Tue, Apr 16, 2019 at 7:43 PM Tim0thy1 @.***> wrote: Hi, I detected the immune subtypes using iAtlas Tools and immune subtypes notebook. But the finalResults are different . How can it be? #####result of notebook###### 1 2 3 4 5 6 Call XY1 0.0000000 0.0000000 0.0312500 0.7265625 0.1562500 0.0859375 4 XY2 0.0000000 0.0000000 0.9375000 0.0000000 0.0234375 0.0390625 3 XY3 0.0078125 0.0000000 0.5781250 0.2578125 0.0312500 0.1250000 3 XY4 0.0078125 0.0156250 0.0000000 0.7500000 0.0390625 0.1875000 4 XY5 0.0000000 0.0078125 0.0625000 0.7109375 0.0937500 0.1250000 4 XY6 0.0078125 0.0234375 0.0000000 0.7500000 0.0000000 0.2187500 4 XY7 0.3046875 0.0625000 0.0000000 0.2109375 0.0000000 0.4218750 6 XY8 0.0234375 0.0000000 0.7968750 0.0156250 0.0390625 0.1250000 3 XY9 0.0156250 0.0234375 0.0000000 0.7812500 0.0234375 0.1562500 4 XY10 0.0156250 0.8984375 0.0000000 0.0156250 0.0000000 0.0703125 2 XY11 0.0078125 0.0156250 0.0000000 0.7500000 0.0078125 0.2187500 4 XY12 0.0000000 0.0000000 0.5000000 0.3828125 0.0781250 0.0390625 3 XY13 0.0000000 0.0000000 0.0000000 0.9531250 0.0078125 0.0390625 4 XY14 0.0078125 0.0156250 0.0000000 0.7656250 0.0156250 0.1953125 4 XY15 0.0000000 0.0000000 0.0000000 0.9453125 0.0156250 0.0390625 4 XY16 0.0078125 0.0000000 0.9453125 0.0000000 0.0078125 0.0390625 3 XY17 0.0390625 0.1875000 0.0000000 0.4687500 0.0000000 0.3046875 4 XY18 0.0156250 0.1171875 0.6250000 0.0546875 0.0000000 0.1875000 3 XY19 0.0078125 0.0156250 0.0000000 0.7812500 0.0390625 0.1562500 4 XY20 0.0468750 0.0234375 0.0000000 0.7031250 0.0000000 0.2265625 4 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3 https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiSyQOyKUNRDmtEiTLY4ic5DRYHXdcUks5vhopLgaJpZM4c0E_M .

Sorry, my English is not very good. I must have mistaken 'How can it be' meaning.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3#issuecomment-484321767, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJFSOFE6TZ7XHQRJYCSCTPQ7FWHANCNFSM4HGQJ7GA .

Apr 18 '19 15:04 Gibbsdavidl

@Gibbsdavidl Thanks for your replay. The website tools and code on R notebook.
Which one do you suggest and why?

Apr 22 '19 02:04 Tim0thy1

I just tried it and got only one difference in cluster label between the notebook and webapp (XY20).

One place that can cause differences is in the selection of 'ensemble size'. The clustering is done using a 'model based method'. And in this case, it was an ensemble of models. That can change the results.

They use the same code... shouldn't be many differences except those due to the random nature of combining across the ensemble.

** If you want to run it many times, running it locally would probably be faster.**

Thanks!

Apr 22 '19 22:04 Gibbsdavidl

@Gibbsdavidl Can you send your code to me. I want to campare your codes to my. By the way, I can't connect the webapp server all day...

Apr 23 '19 07:04 Tim0thy1

@Gibbsdavidl Can you get my message?

Apr 29 '19 00:04 Tim0thy1

Hi there, Sorry about the delay. I just used the notebook in the repo with no changes... "Call_immune_subtypes_on_new_data.ipynb https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/blob/master/Notebooks/Call_immune_subtypes_on_new_data.ipynb "

-dave

On Sun, Apr 28, 2019 at 5:59 PM Tim0thy1 [email protected] wrote:

@Gibbsdavidl https://github.com/Gibbsdavidl Can you get my message?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3#issuecomment-487430435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJFSK3YEX6TAOHCEIGBA3PSZB53ANCNFSM4HGQJ7GA .

Apr 29 '19 16:04 Gibbsdavidl

Webapp seems to be working OK today. Thanks for letting us know. -dave

On Tue, Apr 23, 2019 at 12:18 AM Tim0thy1 [email protected] wrote:

@Gibbsdavidl https://github.com/Gibbsdavidl Can you send your code to me. I want to campare your codes to my. By the way, I can't connect the webapp server all day...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3#issuecomment-485674280, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJFSP5SZ6IKWCFUXJVPIDPR2Z3BANCNFSM4HGQJ7GA .

Apr 29 '19 16:04 Gibbsdavidl

That is very strange. I will try the code on notebook again.
We can communicate with E-mail next time.

Apr 30 '19 01:04 Tim0thy1

Immune-Subtype-Clustering Immune-Subtype-Clustering copied to clipboard

Different result between iAtlas Tools and notebook

Immune-Subtype-Clustering
Immune-Subtype-Clustering copied to clipboard