Contributions icon indicating copy to clipboard operation
Contributions copied to clipboard

KMCUT

Open ibkstore opened this issue 2 years ago • 38 comments

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

  • Repository: https://github.com/ibkstore/kmcut

Confirm the following by editing each check box to '[x]'

  • [ x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.

  • [ x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.

  • [ x] I understand Bioconductor Package Naming Policy and acknowledge Bioconductor may retain use of package name.

  • [ x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.

  • [ x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.

  • [x ] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.

  • [ x] I am familiar with the Bioconductor code of conduct and agree to abide by it.

I am familiar with the essential aspects of Bioconductor software management, including:

  • [ x] The 'devel' branch for new packages and features.
  • [x ] The stable 'release' branch, made available every six months, for bug fixes.
  • [ x] Bioconductor version control using Git (optionally via GitHub).

For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.

ibkstore avatar Apr 20 '23 18:04 ibkstore

Hi @ibkstore

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: kmcut
Type: Package
Title: Optimized Kaplan Meier analysis and identification and validation of prognostic biomarkers
Version: 0.99.0
Authors@R: c(person(given = "Igor", family =  "Kuznetsov", role = c("aut", "cre"), email = "[email protected]"), person(given = "Javed", family = "Khan", role = "aut", email = "[email protected]")) 
Description: The purpose of the package is to identify prognostic biomarkers and an optimal numeric cutoff for each biomarker that can be used to stratify a group of test subjects (samples) into two sub-groups with significantly different survival (better vs. worse). The package was developed for the analysis of gene expression data, such as RNA-seq. However, it can be used with any quantitative variable that has a sufficiently large proportion of unique values.
License: Artistic-2.0
Encoding: UTF-8
LazyData: false
Imports: 
    data.table,
    stats,
    stringr,
    survival,
    tools,
    pracma, 
    grDevices, 
    graphics, 
    utils,
    doParallel,
    foreach, 
    parallel
RoxygenNote: 7.2.3
Suggests: 
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
VignetteBuilder: knitr
biocViews: Software, StatisticalMethod, GeneExpression, Survival
Config/testthat/edition: 3

bioc-issue-bot avatar Apr 20 '23 18:04 bioc-issue-bot

Very nice work, and a good vignette. I see data requirements:

A file with right-censored survival data and

A file with gene expression-like features. The sample identifiers in both files must be exactly the same. The package contains built-in example files with survival data and RNA-seq gene expression data that describe 295 neuroblastoma tumor samples (Zhang et al, 2015).

The file with survival data must contain at least three columns labeled ‘sample_id’, ‘stime’, and ‘scens’. Column ‘sample_id’ contains a unique identifier of each sample (test subject) and must be the first column in the file. Column ‘stime’ contains the survival time for each sample. Column ‘scens’ contains the censoring variable for each sample (0 or 1). If other columns are present in the file, they will be ignored. ‘stime’ and ‘scens’ can be in any column in the file, except the first. An example file with survival data is provided with the package, its content can be printed as follows (the output of the code is not provided because it is too long):

Please use SummarizedExperiment to manage your data, it will help coordinate the samples as needed. You can have survival tkime objects in the colData. Unless there are strong arguments against using SummarizedExperiment or a derivate of that class, we really like to see them used, to foster interoperability and take advantage of familiarity to a large user base. See contributions.bioconductor.org for guidelines.

vjcitn avatar May 20 '23 17:05 vjcitn

Very nice work, and a good vignette. I see data requirements:

A file with right-censored survival data and

A file with gene expression-like features. The sample identifiers in both files must be exactly the same. The package contains built-in example files with survival data and RNA-seq gene expression data that describe 295 neuroblastoma tumor samples (Zhang et al, 2015).

The file with survival data must contain at least three columns labeled ‘sample_id’, ‘stime’, and ‘scens’. Column ‘sample_id’ contains a unique identifier of each sample (test subject) and must be the first column in the file. Column ‘stime’ contains the survival time for each sample. Column ‘scens’ contains the censoring variable for each sample (0 or 1). If other columns are present in the file, they will be ignored. ‘stime’ and ‘scens’ can be in any column in the file, except the first. An example file with survival data is provided with the package, its content can be printed as follows (the output of the code is not provided because it is too long):

Please use SummarizedExperiment to manage your data, it will help coordinate the samples as needed. You can have survival tkime objects in the colData. Unless there are strong arguments against using SummarizedExperiment or a derivate of that class, we really like to see them used, to foster interoperability and take advantage of familiarity to a large user base. See contributions.bioconductor.org for guidelines.

Hi Vince,

Thank you for your comments. I apologize for late reply – I was traveling extensively from the end of May until the very end of June and did not check my GitHub thread (I did not receive an email notification about your post). Regarding your comment about the SummarizedExperiment object. The data requirement you are referring to is about two simple input tab-delimited text files. These two files are read into a standard matrix object. We believe that in the context of our package the use of complex SummarizedExperiment objects is not justified because of the simple nature of input files in our package. Below is a more detailed explanation.

Our package was written for researchers engaged in cancer genomics studies with a substantial clinical trials component, such as clinical oncologists, as a major user group. This group of scientists frequently do not have strong programming skills and rely on GUI-based software, such as GraphPad, for survival analysis. They also use Excel for basic manipulation of spreadsheets, such as a simple gene expression table that does not contain any metadata, or a table with key clinical data, such as survival data. Therefore, we designed our package to be as simple as possible to allow “copy-and-paste” approach – the users would copy and paste the example code and just change the input file names to match theirs. The input text file with expression data is very basic and does not contain any metadata, only gene and sample IDs. The input text file with survival data is also very basic. Both files can be easily created and altered in Excel, thus allowing even a beginner to use our package. The formats of these two files also mimic formats used by clinical oncologists. We believe that the SummarizedExperiment object is unnecessarily complex and confusing for such users. Besides, our package does not use any metadata. The expression and survival data are read into simple matrix objects from base R, which are then utilized by the functions.

Best regards, Igor

ibkstore avatar Jul 11 '23 04:07 ibkstore

SummarizedExperiment object is unnecessarily complex and confusing for such users. Besides, our package does not use any metadata. The expression and survival data are read into simple matrix objects from base R. Sorry for delay in responding. Part of the rationale for self-describing classes in Bioconductor is to reduce risk of error arising from spreadsheet manipulations, incomplete labeling, loss of provenance information. Bioconductor practices can help reduce risk of problems like these, though they can't eliminate the risk completely. Current package contributors follow the contribution guidelines to implement these and related practices. Your user base may not find SummarizedExperiments valuable, but other developers/analysts who might want to reuse your open software to enhance their analytic environments will be at a disadvantage if you don't even try to use conventions that have helped the project and its users for two decades. I will pass this into review but you may expect some encouragement to use Bioconductor infrastructure and practices, even if you don't want to expose these to your user base.

vjcitn avatar Jul 28 '23 15:07 vjcitn

I will also add that in general it is normally not overly complicated to incorporate use of SummarizedExperiment in functions, for example, you could continue to support matrix like objects AND SummarizedExperiment objects by checking for the class of the data presented -- and then within functions subset or manipulate objects for the remainder of your code . AND as @vjcitn pointed out there are a lot of benefits to using formalized classes to prevent mistakes and actually make management of data results easier

lshep avatar Jul 28 '23 15:07 lshep

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot avatar Aug 04 '23 14:08 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Build System.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

The following are build products from R CMD build on the Bioconductor Build System: macOS 12.6.5 Monterey: kmcut_0.99.0.tar.gz Linux (Ubuntu 22.04.2 LTS): kmcut_0.99.0.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/kmcut to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar Aug 04 '23 14:08 bioc-issue-bot

Alright. Let me take a look at kmcut. Sorry for the slow response.

hpages avatar Sep 06 '23 17:09 hpages

Hi @ibkstore,

Sorry again for the delay. Here is some feedback:

  1. The vignette contains the following line:

    Introduction to the KMCUT package
    

    but the name of the package is kmcut (all lower-case). R in general, and package names in particular, are case-sensitive.

  2. Man pages have very long titles, each of them made of several sentences that repeat exactly the content of the Description section. Good man page titles should be short (a few words only).

  3. You seem to misunderstand the purpose of importing other packages in your package. The examples in most man pages start with many library() calls e.g. in ?kmoptpermcut:

    library(survival)
    library(stringr)
    library(data.table)
    library(tools)
    library(pracma)
    library(foreach)
    library(doParallel)
    library(parallel)
    library(kmcut)
    

    Please note that packages that are imported are automatically available to the code inside your package so there's no need to put the extra burden to manually load each of them on the user. For loading the kmcut package itself, a widely accepted convention is to not pollute the examples in the man pages of a given package (say foo) with explicit calls to library(foo). The reason for this convention is that it is considered reasonable to assume that the package is already loaded when the user is looking at its man pages.

  4. Please minimize the dependencies of your package and only import what you actually need to import. For example, the package does not seem to use anything from the stringr package so that dependency is not needed. Please re-evaluate all the other dependencies and make sure to drop those that you don't need.

  5. Also please re-evaluate the need for all the library() calls at the beginning of the vignette (although keeping library(kmcut) is fine there and is actually needed).

  6. Bioconductor recommends against the use of flatcase style for your function names (e.g. kmoptpermcut, transposetable, kmqcut, kmucut, etc...), as it makes the names hard to read and/or remember. Please use snake_case or camelCase instead.

  7. About the file with gene expression-like features. The vignette says: "The first column must contain sample identifiers and and the first row must contain gene identifiers." Isn't it the otherway around?

  8. The kmoptpermcut() function is extremely slow, to a point that seems to make it almost unsuitable for real-world data. For example, calling it on the example_genes_295.txt file provided in the package takes 89 sec on my laptop when n_iter is set to 1000. However, the file only contains 2 rows (i.e. 2 genes) and the man page for kmoptpermcut() recommends to set n_iter to 10,000. This means that using the recommended n_iter on a file with 100 genes will take more than 10 hours!

  9. How useful is the kmoptpermcut() function when called with wpdf=FALSE? In this case, the function will display a series of plot, one after the other, the next one replacing the previous one at a possibly fast pace, with barely enough time for the user to look at them!

  10. If the goal is to make it as simple as possible for clinical oncologists or other scientists without strong programming skills to use the package, then some improvements on how errors are handled would be welcome. For example, I doubt that the targeted audience will be able to make sense of an error message like this one:

    > kmoptscut(fname="survival_data_295.txt", sfname="example_genes_295b.txt")
    Error in order(sdat$sample_id) : argument 1 is not a vector
    
  11. Documenting a function argument like this:

    fname: character vector that specifies the name of the file etc...
    

    without specifying that the character vector must be of length 1 suggests that the character vector could be of arbitrary length. Note that many functions in R that work on character data are vectorized and can handle a character vector of arbitrary length (e.g. tolower() or grepl()) but that doesn't seem to be the case here. Please disambiguate.

  12. Finally note that coordinating right-censored survival data and gene expression-like features, and keeping them aligned when subsetting the data, is exactly what the SummarizedExperiment container is for. In your case, the SummarizedExperiment object would have the feature ids as rownames and the sample ids as colnames, and the right-censored survival time data would be stored in the colData component of the object. All you would need to do is implement a function that loads the input data in a SummarizedExperiment object e.g. load_kmcut_input(). The function would essentially do something like this:

    load_kmcut_input <- function(epath, spath)
    {
        edat <- read.delim(epath, header = TRUE, stringsAsFactors = FALSE)
        sdat <- read.delim(spath, header = TRUE, row.names = 1)
        ...
        ... some sanity checks about 'edat' and 'sdat'
        ...
        assay <- as.matrix(edat[ , -1])
        rownames(assay) <- edat[ , 1]
        coldata <- DataFrame(sdat)
        SummarizedExperiment(list(expr=assay), colData=coldata)
    }
    

    Typical use:

    epath <- system.file("extdata", "example_genes_295.txt", package = "kmcut")
    spath <- system.file("extdata", "survival_data_295.txt", package = "kmcut")
    kmcut_se <- load_kmcut_input(epath, spath)
    kmcut_se
    # class: SummarizedExperiment 
    # dim: 2 295 
    # metadata(0):
    # assays(1): expr
    # rownames(2): MYCN MYH2
    # rowData names(0):
    # colnames(295): SEQC_NB001 SEQC_NB002 ... SEQC_NB491 SEQC_NB492
    # colData names(2): stime scens
    

    Then all functions in the package would take such object as input instead of fname and sfname, making them slightly easier to use and avoiding the need to load the same data from disk again and again.

    Furthermore, the user can now subset the data at will by doing things like kmcut_se["MYH2", c("SEQC_NB002", "SEQC_NB003")] with the guarantee that the right-censored survival data and the gene expression-like data remain aligned at all time, hence avoiding the need of utility functions like extractrows() or extractcolumns(). Every Bioconductor user knows how to perform such basic subsetting operation. This is the true Bioconductor way and it has proven to be a very valuable approach to software design.

Thanks, H.

hpages avatar Sep 12 '23 00:09 hpages

Hi @ibkstore, do you intend to follow up on this submission?

Best, H.

hpages avatar Oct 03 '23 23:10 hpages

Hi Herve, I do intend to follow-up on this submission. However, I am not sure how to proceed – this is my first R package. I am used to feedback from journal article reviewers, when comments are usually divided into major (the ones that must be addressed) and minor (optional). If the submission is rejected, then it is stated explicitly. There is also a specific time limit for revisions. Could you please clarify how the Bioconductor review system works? For instance, what is the time limit? Revision will take a considerable amount of time given the fact that I have heavy teaching and service load now. Can your comments be classified into “mandatory” and “recommended, but optional”? Regarding the kmoptpermcut function: yes, it is very slow and, unfortunately, cannot be made much faster. It estimates the unknown sampling distribution of the test statistics by applying the permutation test, which is very slow. It is recommended that kmoptpermcut be run on at least 16 CPUs (this can be achieved on a low/midrange desktop or high-end laptop). This function is meant to be applied to a relatively small subset of candidate genes identified by the kmoptscut function. Also, a typical number of samples in a dataset supplied to this function is expected to be much smaller than in the example dataset included into the package.

Thank you, Igor

ibkstore avatar Oct 09 '23 04:10 ibkstore

Hi @ibkstore,

Thanks for getting back to me.

I do intend to follow-up on this submission.

Great!

I am used to feedback from journal article reviewers

The review process for Bioconductor software packages is slightly different than with articles submitted to peer-review journals.

If the submission is rejected, then it is stated explicitly.

This submission is still labelled with "review in progress", which means that the package was neither rejected or accepted yet.

Could you please clarify how the Bioconductor review system works?

See https://contributions.bioconductor.org/bioconductor-package-submissions.html#whattoexpect

what is the time limit?

There's no time limit. However we do ask contributors to respond to comment within 2-3 weeks (see above link). Even if they didn't get a chance to address the feedback, acknowledgement that they've received it and/or questions about it are welcome. Submission with no visible activity for a prolonged period of time (e.g. > 1 month) will be considered abandonned and might get closed for inactivity. Note that an issue closed for inactivity doesn't mean that the submission got rejected as the issue can still be re-opened at a later time by the submitter or reviewer.

FWIW this is actually the reason why last week I asked about your intentions. I was considering closing this issue for inactivity but thought I would first ask if you intended to follow up on it.

Can your comments be classified into “mandatory” and “recommended, but optional”?

All the items are mandatory and easy to fix, except items 8 and 12.

About item 8: Trying to speed up kmoptpermcut() would be nice though, if the function is meant to be used on real-world data in reasonable time. However, and to my surprise, it sounds like my naive assumption that real-world data will typically be bigger than the included dataset was wrong (see below). If that's the case, then maybe the speed of kmoptpermcut() is acceptable.

About item 12: @vjcitn and @lshep already pointed the benefits of using a SummarizedExperiment object to present your data to the end user, and to operate on such representation. Item 12 is a follow-up to this discussion. It provides some details about how to do that and what some of the concrete benefits would be. I also wanted to make sure that you're not dismissing this approach too quickly or for the wrong reasons (e.g. "this is too difficult"). Anyways, I will leave this to your consideration. I suppose that adding support for SummarizedExperiment could always be added to kmcut in the future, if there's demand for it.

Also, a typical number of samples in a dataset supplied to this function is expected to be much smaller than in the example dataset included into the package.

Why use an example dataset much bigger than the typical real-world dataset? This makes the code in your examples and vignette unnecessarily slower than it needs to be. Package authors usually do the opposite i.e. they include examples datasets that are smaller than the typical real-world dataset. This approach seems much more sensible!

Best, H.

hpages avatar Oct 11 '23 01:10 hpages

@ibkstore Do you plan on revising the package to the suggestions from @hpages , @vjcitn and myself?

lshep avatar Nov 20 '23 14:11 lshep

@ibkstore Do you plan on revising the package to the suggestions from @hpages , @vjcitn and myself?

Yes, I do plan to revise. I will finish the revisions after this semester ends and I have free time to work on the package.

ibkstore avatar Nov 25 '23 02:11 ibkstore

We will close this issue for now. please comment back here when you are ready to continue and we will reopen the issue.

lshep avatar Dec 27 '23 14:12 lshep

This issue is being closed because there has been no progress for an extended period of time. You may reopen the issue when you have the time to actively participate in the review / submission process. Please also keep in mind that a package accepted to Bioconductor requires a commitment on your part to ongoing maintenance.

Thank you for your interest in Bioconductor.

bioc-issue-bot avatar Dec 27 '23 14:12 bioc-issue-bot

Hello,

I would like to reopen this issue. I created an updated version of the package that addresses all the reviewers' comments. Please advise how to proceed with the upload of the new version and the responses to comments. I am not sure what to do with the new and old versions and with the responses. Do I need to delete the old version and keep the new one and add a file with the responses to comments?

Thank you, Igor

ibkstore avatar Apr 23 '24 03:04 ibkstore

Do the histories of the repositories not match? Did you not work in the same repository as before?

lshep avatar Apr 23 '24 11:04 lshep

I am using the same repository. Does it mean that I can just copy the new version into that repository and reviewers will be able to use the history to access the previous version? I am new to GitHub and do not know how repository history functions and how to check it.

ibkstore avatar Apr 23 '24 20:04 ibkstore

Dear @ibkstore ,

We have reopened the issue to continue the review process. Please remember to push a version bump to git.bioconductor.org to trigger a new build.

bioc-issue-bot avatar Apr 24 '24 11:04 bioc-issue-bot

A reviewer has been assigned to your package for an indepth review. Please respond accordingly to any further comments from the reviewer.

bioc-issue-bot avatar Apr 24 '24 11:04 bioc-issue-bot

You should familiarize yourself with git and git commands. Also this previous comment links to a git workflow for pushing to git.bioconductor.org and indicates that you should set up your BiocCredentials account to control ssh key access.

lshep avatar Apr 24 '24 11:04 lshep

My Bioconductor Git Credentials are set up. I used the sequence of Git commands from the Bioconductor guidelines to push the updated package to git.bioconductor.org I used the following sequence:

git clone [email protected]:packages/kmcut cd kmcut

Copy updated/new package files/folders to the local folder 'kmcut', then

git add * git commit * git push

My responses to reviewer's comments are included in the commit message file. Does this look like a correct sequence? Do I need to perform any additional tasks at this point?

ibkstore avatar Apr 25 '24 04:04 ibkstore

@ibkstore Welcome back!

Does this look like a correct sequence?

No, this is not how you are supposed to resync the kmcut repo at git.bioconductor.org with the repo on GitHub. You seem to have completely missed the link that @lshep gave to you. The link was sending you to this important comment at the beginning of this issue:

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.

Unfortunately the problem with the method you used is that the 2 repos now have divergent histories. As @lshep mentioned, you really need to familiarize yourself with git and git commands. This is an essential skill for being a Bioconductor package maintainer.

Also the git commit message is not really the place to put your responses to my earlier comments. Please put them in this issue. In the future, use the git commit message only to document the changes you are making to your package.

Finally, please bump the version of the package and push in order to trigger a new build.

Thanks, H.

hpages avatar Apr 25 '24 08:04 hpages

Hi Herve, Sorry about the mess. I was confused by the chapter header "21.5 Subversion to Git Transition". Chapter "21 Git Version Control" seemed more appropriate and I used the info from there. Is there anything I can do to "undo" my mistakes before I proceed with Chapter 21.5 steps? Igor

ibkstore avatar Apr 30 '24 02:04 ibkstore

Received a valid push on git.bioconductor.org; starting a build for commit id: d9cd8cf2539e4ab64b4097ac786dfbeb739d7c4c

bioc-issue-bot avatar May 09 '24 21:05 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.3 LTS): kmcut_0.99.1.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/kmcut to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar May 09 '24 21:05 bioc-issue-bot

Received a valid push on git.bioconductor.org; starting a build for commit id: 0123ac513f8adf257a5491cabd66f720564713ff

bioc-issue-bot avatar May 10 '24 02:05 bioc-issue-bot

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on the Bioconductor Single Package Builder.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

The following are build products from R CMD build on the Single Package Builder: Linux (Ubuntu 22.04.3 LTS): kmcut_0.99.2.tar.gz

Links above active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to [email protected]:packages/kmcut to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

bioc-issue-bot avatar May 10 '24 02:05 bioc-issue-bot

Received a valid push on git.bioconductor.org; starting a build for commit id: afdda1a804e2c4ac211797536645c63bb18d1f71

bioc-issue-bot avatar May 12 '24 01:05 bioc-issue-bot