Contributions icon indicating copy to clipboard operation
Contributions copied to clipboard

demuxmix

Open huklein opened this issue 3 years ago • 21 comments

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

  • Repository: https://github.com/huklein/demuxmix

Confirm the following by editing each check box to '[x]'

  • [x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.

  • [x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.

  • [x] I understand Bioconductor Package Naming Policy and acknowledge Bioconductor may retain use of package name.

  • [x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.

  • [x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.

  • [x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.

  • [x] I am familiar with the Bioconductor code of conduct and agree to abide by it.

I am familiar with the essential aspects of Bioconductor software management, including:

  • [x] The 'devel' branch for new packages and features.
  • [x] The stable 'release' branch, made available every six months, for bug fixes.
  • [x] Bioconductor version control using Git (optionally via GitHub).

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

huklein avatar Jul 19 '22 23:07 huklein

Hi @huklein

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: demuxmix
Title: Demultiplexing oligo-barcoded scRNA-seq data using regression mixture models
Version: 0.99.0
Date: 2022-07-19
Description: A package for demultiplexing single-cell sequencing experiments
    of pooled cells that were labeled with barcode oligonucleotides. The
    package provides methods to fit mixture models for a probabilistic
    classification of the cells and to assess the quality and error rates of the
    demultiplexing.
Authors@R: person("Hans-Ulrich", "Klein",
    email = "[email protected]",
    role = c("aut", "cre"),
    comment = c(ORCID = "0000-0002-6382-9428"))
License: Artistic-2.0
Depends: R (>= 3.5)
Imports: stats, MASS, Matrix, ggplot2, gridExtra, methods
Suggests: BiocStyle, knitr, rmarkdown, cowplot
VignetteBuilder: knitr
biocViews: SingleCell, Preprocessing, Classification
BiocType: Software
URL: https://github.com/huklein/demuxmix
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.2.0

bioc-issue-bot avatar Jul 19 '22 23:07 bioc-issue-bot

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot avatar Jul 22 '22 18:07 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/demuxmix to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar Jul 22 '22 18:07 bioc-issue-bot

Received a valid push on git.bioconductor.org; starting a build for commit id: 7ede439d2c3b6c6caa2e9073f90e9ae86392b435

bioc-issue-bot avatar Jul 22 '22 19:07 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/demuxmix to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar Jul 22 '22 19:07 bioc-issue-bot

Received a valid push on git.bioconductor.org; starting a build for commit id: e6f1f9ae8c73d878648cdf8fc020a5d2af064f3a

bioc-issue-bot avatar Jul 24 '22 18:07 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/demuxmix to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar Jul 24 '22 18:07 bioc-issue-bot

Received a valid push on git.bioconductor.org; starting a build for commit id: a3897aac8fe0f942ac70470edfb44ebd35c423ca

bioc-issue-bot avatar Jul 29 '22 03:07 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/demuxmix to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar Jul 29 '22 03:07 bioc-issue-bot

@huklein I'm sorry for the delay in reviewing demuxmix. I've started reviewing the package today and should have a review for you later this week.

PeteHaitch avatar Aug 09 '22 08:08 PeteHaitch

Hi @huklein,

Thank you for submitting demuxmix to Bioconductor. The package is already in great shape and provides an interesting solution for a key step of contemporary scRNA-seq analysis workflows.

For acceptance into Bioconductor, there are a number of Required points, as well as Recommended points, that I would ask you to first please address. Would you please provide line-by-line comments to my initial review so that I know what changes I'm looking for in my re-review.

Cheers, Pete

Required

  • [ ] Using the Bioconductor build system, have you tried building the package with the 'real data' example evaluated and included in the vignette? Did it pass the Bioconductor build system or timeout? I think its inclusion would make a great addition to the vignette.
  • [ ] In the vignette, please add a comparison to DropletUtils::hashedDrops(). This is an (the?) existing Bioconductor-based solution to demultiplexing HTOs. In particular, it is the featured solution in the 'Demultiplexing cell hashes' section of 'Orchestrating Single Cell Analysis with Bioconductor' book. Something along the lines of the following may be suitable, with some discussion and comparison of the results:
suppressPackageStartupMessages(library(demuxmix))
suppressPackageStartupMessages(library(scRNAseq))
suppressPackageStartupMessages(library(DropletUtils))
suppressPackageStartupMessages(library(pheatmap))

# Prepare data -----------------------------------------------------------------

htoExp <- StoeckiusHashingData(type = "mixed")
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
set.seed(666)
eDrops <- emptyDrops(counts(htoExp))
htoExp <- htoExp[, which(eDrops$FDR <= 0.001)]

rna <- colSums(counts(htoExp) > 0)
hto <- assay(altExp(htoExp))

# demuxmix() -------------------------------------------------------------------

dmm <- demuxmix(hto, rna = rna)
classLabels <- dmmClassify(dmm)

# hashedDrops() ----------------------------------------------------------------

hash.stats <- hashedDrops(hto)

# Comparison -------------------------------------------------------------------

demuxmix <- ifelse(
  classLabels$Type == "singlet", classLabels$HTO, classLabels$Type)
hashedDrops <- ifelse(
  hash.stats$Confident,
  rownames(hto)[hash.stats$Best],
  ifelse(hash.stats$Doublet, "doublet", "unknown"))
  
addmargins(table(demuxmix, hashedDrops), margin = c(1, 2))
#>            hashedDrops
#> demuxmix    doublet HEK_A HEK_B HEK_C K562_A K562_B K562_C KG1_A KG1_B KG1_C
#>   HEK_A           1   569     0     0      0      0      0     0     0     0
#>   HEK_B           1     0   611     0      0      0      0     0     0     0
#>   HEK_C           5     0     0   545      0      0      0     0     0     0
#>   K562_A          0     0     0     0    377      0      0     0     0     0
#>   K562_B          1     0     0     0      0    594      0     0     0     0
#>   K562_C          2     0     0     0      0      0    424     0     0     0
#>   KG1_A           3     0     0     0      0      0      0   620     0     0
#>   KG1_B           0     0     0     0      0      0      0     0   538     0
#>   KG1_C           1     0     0     0      0      0      0     0     0   601
#>   multiplet     378    11     8    15      1     11      3    17    12     3
#>   negative        0     0     0     0      0      0      0     3     7     4
#>   THP1_A          3     0     0     0      0      0      0     0     0     0
#>   THP1_B          0     0     0     0      0      0      0     0     0     0
#>   THP1_C          1     0     0     0      0      0      0     0     0     0
#>   uncertain       0     0     0     0      1      0      0     0     0     0
#>   Sum           396   580   619   560    379    605    427   640   557   608
#>            hashedDrops
#> demuxmix    THP1_A THP1_B THP1_C unknown  Sum
#>   HEK_A          0      0      0      31  601
#>   HEK_B          0      0      0      69  681
#>   HEK_C          0      0      0      13  563
#>   K562_A         0      0      0     156  533
#>   K562_B         0      0      0      20  615
#>   K562_C         0      0      0     140  566
#>   KG1_A          0      0      0      23  646
#>   KG1_B          0      0      0      33  571
#>   KG1_C          0      0      0      69  671
#>   multiplet     18     15     13      70  575
#>   negative       0      0      0     245  259
#>   THP1_A       505      0      0       3  511
#>   THP1_B         0    539      0      23  562
#>   THP1_C         0      0    560      40  601
#>   uncertain      0      0      0       0    1
#>   Sum          523    554    573     935 7956

pheatmap(
  table(demuxmix, hashedDrops),
  color = viridisLite::inferno(101),
  scale = "none",
  cluster_rows = TRUE,
  cluster_cols = TRUE)

  • [ ] Please add a BugReports field to the DESCRIPTION. This usually links to the package's GitHub issues page.
  • [ ] Please add an 'Installation' section to the vignette. Typically, this section would also go in the README.md (if using).

Recommended

  • [ ] Perhaps have a final proofread and run a spellcheck over the vignette and other documentation (e.g., 'divers' should be 'diverse' in vignette).
  • [ ] Consider adding captions to vignette figures.
  • [ ] In the vignette, please set.seed() before emptyDrops() (and prior to any other functions that rely on simulation) to ensure reproducibility of results.
  • [ ] A more computationally efficient method to get the number of expressed genes per droplet is to use colSums(counts(htoExp) > 0); see reprex below.
suppressPackageStartupMessages(library(demuxmix))
suppressPackageStartupMessages(library(scRNAseq))
suppressPackageStartupMessages(library(DropletUtils))

htoExp <- StoeckiusHashingData(type = "mixed")
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
eDrops <- emptyDrops(counts(htoExp))
htoExp <- htoExp[, which(eDrops$FDR <= 0.001)]

system.time(a <- apply(assay(htoExp) > 0, 2, sum))
#>    user  system elapsed 
#>   7.947   2.496  16.945
system.time(b <- colSums(counts(htoExp) > 0))
#>    user  system elapsed 
#>   0.291   0.134   0.576

identical(a, b)
#> [1] TRUE
  • [ ] When working through the vignette, I opted to use the 'real data' example instead of the simulated data example. In doing so, I found that the output of plotDmmHistogram(dmm) applied to the real data looked rather different to that when applied to the simulated data (see below). Perhaps providing some commentary on interpreting the real data example would be helpful?
suppressPackageStartupMessages(library(demuxmix))
suppressPackageStartupMessages(library(scRNAseq))
suppressPackageStartupMessages(library(DropletUtils))

# Simulated data ---------------------------------------------------------------

set.seed(563425)
class <- rbind(
  c(rep( TRUE, 200), rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 50)),
  c(rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 200), rep( TRUE, 50)),
  c(rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 50)))
simdata <- dmmSimulateHto(class)
hto <- simdata$hto
rna <- simdata$rna

dmm <- demuxmix(hto, rna = rna)
plotDmmHistogram(dmm)


# Real data  -------------------------------------------------------------------

htoExp <- StoeckiusHashingData(type = "mixed")
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
eDrops <- emptyDrops(counts(htoExp))
htoExp <- htoExp[, which(eDrops$FDR <= 0.001)]

rna <- colSums(counts(htoExp) > 0)
hto <- assay(altExp(htoExp))

dmm <- demuxmix(hto, rna = rna)
plotDmmHistogram(dmm)

Created on 2022-08-10 by the reprex package (v2.0.1)

  • [ ] From the vignette, running dmmOverlap(dmm) < 0.03 on the real data suggests some bad or poor performing HTOs (KG1_B and KG1_C), correct? I'm really just clarifying I've correctly understood this function and its output.
#> dmmOverlap(dmm) < 0.03
 HEK_A  HEK_B  HEK_C THP1_A THP1_B THP1_C K562_A K562_B K562_C  KG1_A  KG1_B  KG1_C 
  TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE  FALSE  FALSE 
  • [ ] The vignette section 'Special usecase: pooling unlabeled with labeled cells' describes an interesting experimental design that I'd not encountered. I wondered if demuxmix could be extended to demultiplex some other niche(?) designs I've seen in my work:
    • Label each sample with a unique combination of k > 1 HTOs. This design can be analysed using the combinations argument of DropletUtils::hashedDrops().
    • Label n samples each with 1 unique HTO and the n+1 sample with all n HTOs. I'm not aware of an existing method designed for demultiplexing this design.
  • [ ] Personally, I found reading the source code a little trick at time due to the formatting. Running BiocCheck::BiocCheck() on the package points to some of the issues and potential solutions, e.g.,
    • Consider shorter line lengths to improve readibility of source code.
    • Consider running styler to get a more consistent formatting of source code.
  • [ ] FYI when constructing a matrix, you can specify row and column names in the constructor via the dimnames argument rather than setting them using rownames<-()/colnames<-() afterwards.
  • [ ] Consider adding a top-level README.md.
  • [ ] For consistency with other functions in demuxmix, consider changing p.acpt()/p.acpt<-() to use snakeCase().
  • [ ] Recommend adding a inst/CITATION file.
  • [ ] Consider adding a package-level man page.
  • [ ] Thank you for adding unit tests. Please take a look at covr::report() to identify lines that aren't being tested and perhaps aim to ensure that critical funcionality is well-tested.

PeteHaitch avatar Aug 10 '22 03:08 PeteHaitch

Hi @PeteHaitch ,

Thanks so much for the detailed review! I will need a couple of days to implement the improvements. Before I start, I have one question regarding the vignette. I agree that the real dataset is a better use case for the vignette. There are two reasons why I still decided to use the simulated data. One argument was the runtime, which you addressed. My other thought was that I did not want to force users to download the relatively large dataset just to install my package. I assume the dataset wouldn't be downloaded on the BioC build system because of caching. Still, it would further slow down the installation process for users and trigger the download of a larger dataset. I am happy to rewrite the vignette, including the comparison with hashedDrops (nice idea!). I just want to ensure it is BioC policy to prioritize more realistic vignettes over lean packages with little network/disk space usage. I wish there were another option to add this use case without automatically running it at the installation.

Best, Hans

huklein avatar Aug 10 '22 21:08 huklein

My understanding is that the data used in the vignette will only be downloaded from ExperimentHub (via scRNAseq) if a user is installing the package from source (and building/installing the vignette as part of that process). Generally, only users on Linux will be installing from source. Users on macOS and Windows (the majority of users) will almost always be installing the binary packages that Bioconductor makes available on those platforms. The binary installation includes the vignette in its rendered form (i.e. HTML in your case), so the user won't have to download the dataset simply to read the vignette. Of course, if they want to run the code in the vignette themselves then they will have to download the data, but that's a one-time cost due to the caching functionality implemented in ExperimentHub.

I'll ask someone from @Bioconductor/core to confirm my understanding is correct and to comment on their preference re the tradeoff of "... a more realistic vignettes [vs.] lean packages with little network/disk space usage". Since the example in your vignette is great at re-using existing BioC data, I would hope that tips the balance towards supporting a more realistic vignette.

PeteHaitch avatar Aug 10 '22 22:08 PeteHaitch

Received a valid push on git.bioconductor.org; starting a build for commit id: ab716dee9c8be4dc06abd1ca32937080a92c336c

bioc-issue-bot avatar Aug 22 '22 02:08 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/demuxmix to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar Aug 22 '22 02:08 bioc-issue-bot

Received a valid push on git.bioconductor.org; starting a build for commit id: 4238d7cca746ee083eec7f3d30486c727af20b5f

bioc-issue-bot avatar Aug 22 '22 02:08 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/demuxmix to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar Aug 22 '22 02:08 bioc-issue-bot

Hi @huklein,

Thank you for submitting demuxmix to Bioconductor. The package is already in great shape and provides an interesting solution for a key step of contemporary scRNA-seq analysis workflows.

For acceptance into Bioconductor, there are a number of Required points, as well as Recommended points, that I would ask you to first please address. Would you please provide line-by-line comments to my initial review so that I know what changes I'm looking for in my re-review.

Cheers, Pete

Required

  • [ ] Using the Bioconductor build system, have you tried building the package with the 'real data' example evaluated and included in the vignette? Did it pass the Bioconductor build system or timeout? I think its inclusion would make a great addition to the vignette.
  • [ ] In the vignette, please add a comparison to DropletUtils::hashedDrops(). This is an (the?) existing Bioconductor-based solution to demultiplexing HTOs. In particular, it is the featured solution in the 'Demultiplexing cell hashes' section of 'Orchestrating Single Cell Analysis with Bioconductor' book. Something along the lines of the following may be suitable, with some discussion and comparison of the results:
suppressPackageStartupMessages(library(demuxmix))
suppressPackageStartupMessages(library(scRNAseq))
suppressPackageStartupMessages(library(DropletUtils))
suppressPackageStartupMessages(library(pheatmap))

# Prepare data -----------------------------------------------------------------

htoExp <- StoeckiusHashingData(type = "mixed")
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
set.seed(666)
eDrops <- emptyDrops(counts(htoExp))
htoExp <- htoExp[, which(eDrops$FDR <= 0.001)]

rna <- colSums(counts(htoExp) > 0)
hto <- assay(altExp(htoExp))

# demuxmix() -------------------------------------------------------------------

dmm <- demuxmix(hto, rna = rna)
classLabels <- dmmClassify(dmm)

# hashedDrops() ----------------------------------------------------------------

hash.stats <- hashedDrops(hto)

# Comparison -------------------------------------------------------------------

demuxmix <- ifelse(
  classLabels$Type == "singlet", classLabels$HTO, classLabels$Type)
hashedDrops <- ifelse(
  hash.stats$Confident,
  rownames(hto)[hash.stats$Best],
  ifelse(hash.stats$Doublet, "doublet", "unknown"))
  
addmargins(table(demuxmix, hashedDrops), margin = c(1, 2))
#>            hashedDrops
#> demuxmix    doublet HEK_A HEK_B HEK_C K562_A K562_B K562_C KG1_A KG1_B KG1_C
#>   HEK_A           1   569     0     0      0      0      0     0     0     0
#>   HEK_B           1     0   611     0      0      0      0     0     0     0
#>   HEK_C           5     0     0   545      0      0      0     0     0     0
#>   K562_A          0     0     0     0    377      0      0     0     0     0
#>   K562_B          1     0     0     0      0    594      0     0     0     0
#>   K562_C          2     0     0     0      0      0    424     0     0     0
#>   KG1_A           3     0     0     0      0      0      0   620     0     0
#>   KG1_B           0     0     0     0      0      0      0     0   538     0
#>   KG1_C           1     0     0     0      0      0      0     0     0   601
#>   multiplet     378    11     8    15      1     11      3    17    12     3
#>   negative        0     0     0     0      0      0      0     3     7     4
#>   THP1_A          3     0     0     0      0      0      0     0     0     0
#>   THP1_B          0     0     0     0      0      0      0     0     0     0
#>   THP1_C          1     0     0     0      0      0      0     0     0     0
#>   uncertain       0     0     0     0      1      0      0     0     0     0
#>   Sum           396   580   619   560    379    605    427   640   557   608
#>            hashedDrops
#> demuxmix    THP1_A THP1_B THP1_C unknown  Sum
#>   HEK_A          0      0      0      31  601
#>   HEK_B          0      0      0      69  681
#>   HEK_C          0      0      0      13  563
#>   K562_A         0      0      0     156  533
#>   K562_B         0      0      0      20  615
#>   K562_C         0      0      0     140  566
#>   KG1_A          0      0      0      23  646
#>   KG1_B          0      0      0      33  571
#>   KG1_C          0      0      0      69  671
#>   multiplet     18     15     13      70  575
#>   negative       0      0      0     245  259
#>   THP1_A       505      0      0       3  511
#>   THP1_B         0    539      0      23  562
#>   THP1_C         0      0    560      40  601
#>   uncertain      0      0      0       0    1
#>   Sum          523    554    573     935 7956

pheatmap(
  table(demuxmix, hashedDrops),
  color = viridisLite::inferno(101),
  scale = "none",
  cluster_rows = TRUE,
  cluster_cols = TRUE)

  • [ ] Please add a BugReports field to the DESCRIPTION. This usually links to the package's GitHub issues page.
  • [ ] Please add an 'Installation' section to the vignette. Typically, this section would also go in the README.md (if using).

Recommended

  • [ ] Perhaps have a final proofread and run a spellcheck over the vignette and other documentation (e.g., 'divers' should be 'diverse' in vignette).
  • [ ] Consider adding captions to vignette figures.
  • [ ] In the vignette, please set.seed() before emptyDrops() (and prior to any other functions that rely on simulation) to ensure reproducibility of results.
  • [ ] A more computationally efficient method to get the number of expressed genes per droplet is to use colSums(counts(htoExp) > 0); see reprex below.
suppressPackageStartupMessages(library(demuxmix))
suppressPackageStartupMessages(library(scRNAseq))
suppressPackageStartupMessages(library(DropletUtils))

htoExp <- StoeckiusHashingData(type = "mixed")
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
eDrops <- emptyDrops(counts(htoExp))
htoExp <- htoExp[, which(eDrops$FDR <= 0.001)]

system.time(a <- apply(assay(htoExp) > 0, 2, sum))
#>    user  system elapsed 
#>   7.947   2.496  16.945
system.time(b <- colSums(counts(htoExp) > 0))
#>    user  system elapsed 
#>   0.291   0.134   0.576

identical(a, b)
#> [1] TRUE
  • [ ] When working through the vignette, I opted to use the 'real data' example instead of the simulated data example. In doing so, I found that the output of plotDmmHistogram(dmm) applied to the real data looked rather different to that when applied to the simulated data (see below). Perhaps providing some commentary on interpreting the real data example would be helpful?
suppressPackageStartupMessages(library(demuxmix))
suppressPackageStartupMessages(library(scRNAseq))
suppressPackageStartupMessages(library(DropletUtils))

# Simulated data ---------------------------------------------------------------

set.seed(563425)
class <- rbind(
  c(rep( TRUE, 200), rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 50)),
  c(rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 200), rep( TRUE, 50)),
  c(rep(FALSE, 200), rep(FALSE, 200), rep( TRUE, 200), rep(FALSE, 50)))
simdata <- dmmSimulateHto(class)
hto <- simdata$hto
rna <- simdata$rna

dmm <- demuxmix(hto, rna = rna)
plotDmmHistogram(dmm)

# Real data  -------------------------------------------------------------------

htoExp <- StoeckiusHashingData(type = "mixed")
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> snapshotDate(): 2022-07-22
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
eDrops <- emptyDrops(counts(htoExp))
htoExp <- htoExp[, which(eDrops$FDR <= 0.001)]

rna <- colSums(counts(htoExp) > 0)
hto <- assay(altExp(htoExp))

dmm <- demuxmix(hto, rna = rna)
plotDmmHistogram(dmm)

Created on 2022-08-10 by the reprex package (v2.0.1)

  • [ ] From the vignette, running dmmOverlap(dmm) < 0.03 on the real data suggests some bad or poor performing HTOs (KG1_B and KG1_C), correct? I'm really just clarifying I've correctly understood this function and its output.
#> dmmOverlap(dmm) < 0.03
 HEK_A  HEK_B  HEK_C THP1_A THP1_B THP1_C K562_A K562_B K562_C  KG1_A  KG1_B  KG1_C 
  TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE  FALSE  FALSE 
  • [ ] The vignette section 'Special usecase: pooling unlabeled with labeled cells' describes an interesting experimental design that I'd not encountered. I wondered if demuxmix could be extended to demultiplex some other niche(?) designs I've seen in my work:

    • Label each sample with a unique combination of k > 1 HTOs. This design can be analysed using the combinations argument of DropletUtils::hashedDrops().
    • Label n samples each with 1 unique HTO and the n+1 sample with all n HTOs. I'm not aware of an existing method designed for demultiplexing this design.
  • [ ] Personally, I found reading the source code a little trick at time due to the formatting. Running BiocCheck::BiocCheck() on the package points to some of the issues and potential solutions, e.g.,

    • Consider shorter line lengths to improve readibility of source code.
    • Consider running styler to get a more consistent formatting of source code.
  • [ ] FYI when constructing a matrix, you can specify row and column names in the constructor via the dimnames argument rather than setting them using rownames<-()/colnames<-() afterwards.

  • [ ] Consider adding a top-level README.md.

  • [ ] For consistency with other functions in demuxmix, consider changing p.acpt()/p.acpt<-() to use snakeCase().

  • [ ] Recommend adding a inst/CITATION file.

  • [ ] Consider adding a package-level man page.

  • [ ] Thank you for adding unit tests. Please take a look at covr::report() to identify lines that aren't being tested and perhaps aim to ensure that critical funcionality is well-tested.

huklein avatar Aug 22 '22 03:08 huklein

Dear @huklein ,

We have reopened the issue to continue the review process. Please remember to push a version bump to git.bioconductor.org to trigger a new build.

bioc-issue-bot avatar Aug 22 '22 03:08 bioc-issue-bot

Hi @PeteHaitch,

Thanks again for the helpful feedback! Please see my response to the comments below.

Required

  • [ ] Using the Bioconductor build system, have you tried building the package with the 'real data' example evaluated and included in the vignette? Did it pass the Bioconductor build system or timeout? I think its inclusion would make a great addition to the vignette.

I was thinking about this point a lot. Building the vignette with the real data takes about 2min on my computer. The package shouldn't time out on the build system. In the end, I still prefer to stick with the simulated data as the main dataset in the vignette for the following reasons:

  1. Resources: As a Linux user, I might be more sensitive to long package building times than others, which I experience after every new Bioconductor release. More annoying is that I would have to download and save a relatively large dataset on my PC and on the cluster I work on, even though the dataset would never be used on the latter. I looked at other packages like DropletUtils, and as far as I could see, they use small simulated datasets, too.
  2. Dataset design: The cell line mixture dataset might be better than a simulated dataset, but it is not ideal for showcasing demultiplexing either. The number of samples is untypically large (12 pooled samples). We usually don't pool more than four. Further, the cells are very homogenous within samples (same cell line) and very heterogenous between samples, which is uncommon in most "real" datasets. Thus, it is not a "typical" dataset.
  3. Improved simulated data: I tweaked the simulation a bit to make the simulated data more similar to the real data. Specifically, the third HTO has a lower quality now, similar to the last HTO in the cell line mixture dataset.

I revised the vignette and believe it is very informative now. There is still a small real dataset (CSF dataset) included, and it is still fairly easy to walk through the vignette using the cell line mixture instead of the simulated dataset if a user wants to. Please take a look and let me know what you think. It could quickly switch to the cell line mixture dataset; I am just not convinced it is the best course of action.

  • [ ] In the vignette, please add a comparison to DropletUtils::hashedDrops(). This is an (the?) existing Bioconductor-based solution to demultiplexing HTOs. In particular, it is the featured solution in the 'Demultiplexing cell hashes' section of 'Orchestrating Single Cell Analysis with Bioconductor' book. Something along the lines of the following may be suitable, with some discussion and comparison of the results [...]

Thanks for the suggestion and the code example! I have added a comparison to hashedDrops(). Yes, I believe hashedDrops() is currently the only method for demultiplexing in Bioconductor. There are other R packages, though.

  • [ ] Please add a BugReports field to the DESCRIPTION. This usually links to the package's GitHub issues page.

Done.

  • [ ] Please add an 'Installation' section to the vignette. Typically, this section would also go in the README.md (if using).

Done.

Recommended

  • [ ] Perhaps have a final proofread and run a spellcheck over the vignette and other documentation (e.g., 'divers' should be 'diverse' in vignette).

I re-wrote some sections for better readability and did final proofreading; I hope I found most typos.

  • [ ] Consider adding captions to vignette figures.

Done.

  • [ ] In the vignette, please set.seed() before emptyDrops() (and prior to any other functions that rely on simulation) to ensure reproducibility of results.

Done.

  • [ ] A more computationally efficient method to get the number of expressed genes per droplet is to use colSums(counts(htoExp) > 0); see reprex below [...]

Thanks! I didn't know this. I replaced apply() with colSums() and rowSums() whenever possible throughout the package.

  • [ ] When working through the vignette, I opted to use the 'real data' example instead of the simulated data example. In doing so, I found that the output of plotDmmHistogram(dmm) applied to the real data looked rather different to that when applied to the simulated data (see below). Perhaps providing some commentary on interpreting the real data example would be helpful?

This is a good point. Unfortunately, the histograms can look quite different between real (high-quality) datasets. I added a paragraph explaining the reasons and describing what users should look for. I also added a code block showing how to zoom into the critical part of the histogram.

  • [ ] From the vignette, running dmmOverlap(dmm) < 0.03 on the real data suggests some bad or poor performing HTOs (KG1_B and KG1_C), correct? I'm really just clarifying I've correctly understood this function and its output.

Thanks for the feedback. I downtoned the statement in the revised vignette. From my experience, an overlap < 0.03 suggests high-quality HTO data and will usually result in very few "uncertain" droplets. Of course, it is just a rule of thumb, and I have seen and used datasets with larger overlaps. When I see values exceeding 0.03, I usually give feedback to the wet lab, and we tried to optimize antibody concentrations, washing, etc., but the dataset is still useable (after removing the uncertain droplets). I am usually concerned about values above 0.1. However, such a low overlap (<0.03) may be hard to achieve if working with low-quality tissue. Regarding the cell line mixture dataset, I think the last HTO (HG1_C) is not of optimal quality.

  • [ ] The vignette section 'Special usecase: pooling unlabeled with labeled cells' describes an interesting experimental design that I'd not encountered. I wondered if demuxmix could be extended to demultiplex some other niche(?) designs I've seen in my work:

    • Label each sample with a unique combination of k > 1 HTOs. This design can be analysed using the combinations argument of DropletUtils::hashedDrops().
    • Label n samples each with 1 unique HTO and the n+1 sample with all n HTOs. I'm not aware of an existing method designed for demultiplexing this design.

Interesting. I looked into the combinations argument of hashedDrops(). This could be implemented into the dmmClassify() and summary() methods without any changes to the actual demultiplexing method. The main difference would be the output of summary(), which currently assumes each droplet with more than one HTO is a multiplet. However, demumix can already be used for these special designs: The first column of the data frame returned by dmmClassify() specifies the exact type of multiplet detected, and one can extract any combination of HTOs of interest. (demuxmix simply returns the most likely class - whether it is a singlet or a multiplet.) Therefore, I have not added an additional argument (like combinations), but if these special designs are more common than I think they are, I will add it later.

  • [ ] Personally, I found reading the source code a little trick at time due to the formatting. Running BiocCheck::BiocCheck() on the package points to some of the issues and potential solutions, e.g.,

    • Consider shorter line lengths to improve readibility of source code.
    • Consider running styler to get a more consistent formatting of source code.

I played with the styler package - it has some neat functions. I run it on all my code and hope the style is more consistent with the Bioconductor style now. I also broke up most long lines into shorter ones.

  • [ ] FYI when constructing a matrix, you can specify row and column names in the constructor via the dimnames argument rather than setting them using rownames<-()/colnames<-() afterwards.

Thanks! I use the dimnames argument in the revised code.

Done.

  • [ ] For consistency with other functions in demuxmix, consider changing p.acpt()/p.acpt<-() to use snakeCase().

Done.

The standard citation referencing the R package is fine for now. I am writing up a brief manuscript and will update the citation file when it is ready.

Done.

  • [ ] Thank you for adding unit tests. Please take a look at covr::report() to identify lines that aren't being tested and perhaps aim to ensure that critical funcionality is well-tested.

This is a weak spot of my package. Unfortunately, I started adding unit tests after the package was almost final. I am now familiar with unit tests and see how helpful they can be. For any future package change, I will add a respective unit test.

huklein avatar Aug 22 '22 17:08 huklein

Thanks for responding to the review, @huklein. I'm not sure if I'll have time this week to go through your responses in detail, but you can expect a response in 1-2 weeks.

PeteHaitch avatar Aug 22 '22 23:08 PeteHaitch

Thanks for making the requested changes, @huklein. I'm pleased to accept demuxmix into Bioconductor!

One minor typo to fix:

  • [ ] Missing parenthesis in vignette? "Still, the overlap with the positive component (mean of 4977.8 is ..." should I think be "Still, the overlap with the positive component (mean of 4977.8) is".

PeteHaitch avatar Sep 04 '22 23:09 PeteHaitch

Your package has been accepted. It will be added to the Bioconductor nightly builds.

Thank you for contributing to Bioconductor!

Reviewers for Bioconductor packages are volunteers from the Bioconductor community. If you are interested in becoming a Bioconductor package reviewer, please see Reviewers Expectations.

bioc-issue-bot avatar Sep 04 '22 23:09 bioc-issue-bot

BTW I have been trialling demuxmix on some of my own datasets and have some queries and discussion points that are beyond the scope of this review. I will follow-up these with you on https://github.com/huklein/demuxmix/issues

PeteHaitch avatar Sep 04 '22 23:09 PeteHaitch

Thank you for the helpful package review! I fixed the typos in the vignette and will reply to your additional comments on GitHub later this week.

Cheers, Hans

huklein avatar Sep 07 '22 00:09 huklein

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/huklein.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("demuxmix"). The package 'landing page' will be created at

https://bioconductor.org/packages/demuxmix

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.

lshep avatar Sep 09 '22 11:09 lshep