QFeatures
QFeatures copied to clipboard
Aggregate features optimisation fix
Hello,
Two weeks ago, we merged PR #224 into master. The goal of this PR was to introduce an optimization trick to QFeatures::aggregateFeatures similar to the optimisation applied to scp::aggregateFeaturesOverAssays (cf. scp#80) that led to a significant increase in performance.
At the time, I didn’t rerun the benchmarks with the updated aggregateFeatures method in QFeatures to check whether the performance gains were comparable to those observed with scp::aggregateFeaturesOverAssays.
However, last week, while benchmarking the optimizations I had made, I realized that the performance boost was much less noticeable with the new implementation of aggregateFeatures. In some cases, it even performed worse than the original implementation.
I then looked into the differences between scp::aggregateFeaturesOverAssays and QFeatures::aggregateFeatures. The main difference is that scp::aggregateFeaturesOverAssays extracts all assays into an ExperimentList object, processes them, and then creates a new QFeatures object. On the other hand, QFeatures::aggregateFeatures operates directly on the QFeatures object.
I re-implemented the QFeatures::aggregateFeatures method to follow the same approach as scp::aggregateFeaturesOverAssays. Benchmarks of this new implementation show that it achieves the same performance boost as the optimized scp::aggregateFeaturesOverAssays.
In fact, the results are even better than expected: when I subset the rowData before aggregation (subsetRowData), the new implementation shows improved performance compared to the original version, even though the optimization trick has no impact in this case. Here aggregation2 is the fixed optimisation while aggregation is the current optimisation present in master.
While re-implementing the method, I came across a specific case where it wasn’t working. When using an adjacencyMatrix, the corresponding column from the rowData is removed, which prevents .get_hits() from establishing any links. I fixed the issue by catching the error and returning an empty instance of Hits when the operation fails. However, I’m not sure this is the best solution, what do you think?
Also, do you think we should display a more informative message during aggregation, similar to what’s done in scp::aggregateFeaturesOverAssays? Something like this:
> library("QFeatures")
> data(feat2)
> aggregateFeatures(feat2, i = 1:3, fcol = "Prot", name = c("1", "2", "3"))
Aggregated: 1/3
Aggregated: 2/3
Aggregated: 3/3
@lgatto @cvanderaa
Hello, I saw you requested my review, but I have very little time right now... Do you want this merged for next release? If so, I'll accept but I won't have read it.
Hello, I saw you requested my review, but I have very little time right now... Do you want this merged for next release? If so, I'll accept but I won't have read it.
Hello, I don't think that we absolutely need this PR for the next release. Let's take some time to review this later.
@cvanderaa @lgatto Here is an reproducible example to show the issue with adjacency matrix in the new implementation.
I will investigate how to fix it (maybe by using a .get_Hits_adjMatrix function).
if (!require("remotes", quietly = TRUE)){
install.packages("remotes")
}
remotes::install_github("leopoldguyot/VerR",
build_manual = TRUE,
build_vignettes = TRUE
)
library(VerR)
# new implementation presented in this PR
envCreate("newQFeat", "leopoldguyot/QFeatures@aggregateFeatures_optimisation")
# current QFeatures implementation
envCreate("baseQFeat", "rformassspectrometry/QFeatures@1b2c1a6cd9500202c969ac30bb422fee89b4ed81")
res <- runInEnv({
library(QFeatures)
data(feat1)
feat1 <- aggregateFeatures(feat1, "psms", "Sequence", name = "peptides")
se <- feat1[["peptides"]]
rowData(se)$Protein[3] <- c("ProtA;ProtB")
adj <- matrix(0, nrow = 3, ncol = 2,
dimnames = list(rownames(se),
c("ProtA", "ProtB")))
adj[1, 1] <- adj[2, 2] <- adj[3, 1:2] <- 1
adjacencyMatrix(se) <- adj
ft <- QFeatures(list(peps = se))
ft <- aggregateFeatures(ft, "peps", "adjacencyMatrix", name = "protsByMat",
fun = MsCoreUtils::colMeansMat)
ft
})
res$baseQFeat@assayLinks[[2]]
AssayLink for assay <protsByMat>
[from:peps|fcol:adjacencyMatrix|hits:4]
res$newQFeat@assayLinks[[2]]
AssayLink for assay <protsByMat>
[from:peps|fcol:adjacencyMatrix|hits:0]
res$newQFeat["ProtA", , ]
Error in .valid_QFeatures_indices(object) : Assay links names are wrong.
Hello @lgatto could you restart the check? The issue with BiocCheck seems to be fixed.
I will recreate a new branch and a PR to facilitate the review process :sweat_smile:.
Fix introduced with #243 PR.
Is it intended that you closed the PR but did not merge it?
Is it intended that you closed the PR but did not merge it?
Yes, I created #243 because this PR became too messy for the review. So all the wanted changes were merged into master with #243.