joss-reviews [REVIEW]: MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis

Submitting author: @ParkvilleData (Babak Shaban) Repository: https://github.com/ParkvilleData/MetaGenePipe/ Branch with paper.md (empty if default branch): Version: v.1.0.0 Editor: @jmschrei Reviewers: @Ebedthan, @rjorton Archive: Pending

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/c9c52942084258507eeb1693b83153ba"><img src="https://joss.theoj.org/papers/c9c52942084258507eeb1693b83153ba/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/c9c52942084258507eeb1693b83153ba/status.svg)](https://joss.theoj.org/papers/c9c52942084258507eeb1693b83153ba)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@Ebedthan & @rjorton, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @jmschrei know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Checklists

📝 Checklist for @Ebedthan

Oct 13 '22 16:10 editorialbot

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf

Oct 13 '22 16:10 editorialbot

Software report:

github.com/AlDanial/cloc v 1.88  T=0.30 s (195.3 files/s, 256489.6 lines/s)
---------------------------------------------------------------------------------------
Language                             files          blank        comment           code
---------------------------------------------------------------------------------------
JSON                                     3             39              0          65992
Python                                  17            750           1028           2209
TeX                                      2             87              0           1090
Perl                                     5            184            230            564
Markdown                                 5            134              0            314
Jupyter Notebook                         4              0           2719            171
Windows Module Definition                1             20              0            126
reStructuredText                         6             86             60            126
YAML                                     3              5              5             63
Bourne Shell                             8             16             17             43
DOS Batch                                1              8              1             26
TOML                                     1              3              0             21
make                                     1              4              7              9
SVG                                      1              0              1              3
---------------------------------------------------------------------------------------
SUM:                                    58           1336           4068          70757
---------------------------------------------------------------------------------------


gitinspector failed to run statistical information for the repository

Oct 13 '22 16:10 editorialbot

Wordcount for paper.md is 1579

Oct 13 '22 16:10 editorialbot

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1002/jmv.24839 is OK
- 10.1016/j.cell.2018.08.013 is OK
- 10.1038/nmeth.1923 is OK
- 10.1093/bioinformatics/btr507 is OK
- 10.1371/journal.pone.0017288 is OK
- 10.1186/1471-2105-11-119 is OK
- 10.1038/nmeth.3176 is OK
- 10.1093/bioinformatics/btp698 is OK
- 10.1186/1471-2105-10-421 is OK
- 10.1093/nar/25.17.3389 is OK
- 10.3233/WOR-2012-0507-2643 is OK
- 10.3233/wor-2012-0508-2656 is OK
- 10.3233/wor-2012-1032-2661 is OK
- 10.1016/S0022-2836(05)80360-2 is OK
- 10.1093/bioinformatics/btz859 is OK
- 10.1093/nargab/lqaa026 is OK
- 10.1007/978-1-4939-9173-0_6 is OK
- 10.1093/bioinformatics/btv033 is OK
- 10.1093/bioinformatics/bts174 is OK
- 10.1093/nar/28.1.27 is OK
- 10.1002/pro.3715 is OK
- 10.1093/nar/gkaa970 is OK
- 10.1186/gb-2014-15-3-r46 is OK
- 10.5281/zenodo.5127899 is OK
- 10.1093/bioinformatics/btu170 is OK
- 10.1007/978-1-59745-535-0_4 is OK
- 10.1093/bioinformatics/btr174 is OK
- 10.1093/bioinformatics/btp352 is OK
- 10.1093/bioinformatics/btw354 is OK
- 10.1093/bioinformatics/btab184 is OK
- 10.1007/978-1-4939-3369-3_13 is OK
- 10.1101/2021.08.29.458094 is OK
- 10.1186/s12859-020-03585-4 is OK
- 10.1371/journal.pcbi.1008716 is OK
- 10.12688/f1000research.29032.1 is OK
- 10.1038/nbt.3820 is OK
- 10.1371/journal.pone.0177459 is OK
- 10.1093/bioinformatics/btab184 is OK
- 10.7490/f1000research.1114634.1 is OK
- 10.1093/bioinformatics/btz859 is OK
- 10.1093/nar/gky092 is OK
- 10.1038/s41592-021-01101-x is OK
- 10.1093/bioinformatics/bts174 is OK
- 10.1093/bioinformatics/btv033 is OK
- 10.1186/1471-2105-11-119 is OK
- 10.1038/s41598-020-67416-5 is OK
- 10.1038/s41598-020-67416-5 is OK

MISSING DOIs

- None

INVALID DOIs

- None

Oct 13 '22 16:10 editorialbot

Howdy @Ebedthan and @rjorton!

Thanks for agreeing to review this submission.

The process for conducting a review is outlined above. Please run the command shown above to have @editorialbot generate your checklist, which will give a step-by-step process for conducting your review. Please check the boxes during your review to keep track, as well as make comments in this thread or open issues in the repository itself to point out issues you encounter. Keep in mind that our aim is to improve the submission to the point where it is of high enough quality to be accepted, rather than to provide a yes/no decision, and so having a conversation with the authors is encouraged rather than providing a single review post at the end of the process.

Here are the review guidelines: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html And here is a checklist, similar to above: https://joss.readthedocs.io/en/latest/review_checklist.html

Please let me know if you encounter any issues or need any help during the review process, and thanks for contributing your time to JOSS and the open source community!

Oct 13 '22 16:10 jmschrei

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

Oct 13 '22 16:10 editorialbot

Review checklist for @Ebedthan

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the https://github.com/ParkvilleData/MetaGenePipe/?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@ParkvilleData) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
[x] Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
[x] Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
[x] Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Oct 15 '22 14:10 Ebedthan

Summary and overall critics We thank the authors and @ParkvilleData for the proposed article which is a valuable study proposing a useful tool for bioinformaticians and computational biologists. I have particularly loved the idea and implementation that take into account the possibility to run it on servers. I have also appreciated the software website. However, this article and the associated code repository suffer from some correctable shortcomings.

Major points

Statement of need: I would suggest the statement of need be shortened. The full statement could then be found on the software website or in the article.
Installation instructions: Even if the software has a website that contains all the information about the installation of the software, I believe an installation section in the repository README is needed. I also believe that this would make it easier for an already convinced user to quickly install the software for use.
Installation instructions: Java and Singularity which are required dependencies for MGP do not have a minimum supported version specified.
Example usage: As for the installation instruction, the repo README should contain an example usage section.
Community guidelines: The software community guideline should benefit from a section in the README repository.

Minor points

Line 13: in its default form -> in its default format
Line 14: which uses -> that uses
Line 16: is freely available and is distributed -> is freely available and distributed
Line 22: which downloads -> that downloads
Line 26: infrastructure -> infrastructures
Line 33: A missing comma: Currently -> Currently,
Line 34: function -> functions
Line 50: best practice -> best practices
Line 52: Unnecessary comma: Singularity containers, and -> Singularity containers and
Line 52 - 53: increases flexibility -> increases the flexibility
Line 78: to facilitate co-assembly -> to facilitate the co-assembly
Line 102: produces an fa file -> produces a Fasta file
Line 102: aminoacid -> amino acid
Line 113: SPARTAN high performance -> SPARTAN high-performance
Add the software website link in the repository about section to ease its discovery by the user

Oct 26 '22 20:10 Ebedthan

@Ebedthan thank you for your helpful suggestions. We have incorporated them all into the paper. Would you please be able to check the shortened statement of need to make sure it's clear?

Oct 28 '22 00:10 mariadelmarq

@Ebedthan We've also addressed the other issues you've raised

Oct 28 '22 01:10 rbturnbull

@editorialbot generate pdf

Oct 28 '22 01:10 rbturnbull

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

Oct 28 '22 01:10 editorialbot

It's a nicely written paper, below are some comments on the paper, I'm going to download and try the tool next. Minor points on the paper mostly centring around the "output" of the tool which isn't really described in much detail anywhere either in the paper or on the GitHub repo. I think some minor expansion on the main taxonomic output files of the tool would be useful. Specific points:

13 - "output that is useful in its default format" - what is the default format and why is it useful

31-33 - clarify - so MetaGenePipe does not do any nucleotide to nucleotide comparison then - and would potentially miss non-coding sequence classification? Later on (line 74) the BLAST NT db is mentioned

35 - Think it would be good to give a one line explanation on each of these - what are Kegg Brite and KoalaFarm profiles (this could come under explanation of output - see later)

49 - do you have any guidance on how it could be applied to viruses?

75 - the output of MetaGenePipe is not really describe in much detail anywhere. Given BLAST output is quite structured - what parsing is happening here? How does this result in more easily searchable data? Are the "no hits" in a separate file?

79-87 - for quantifying relative abundance - how can this be done here? Is a file created that has a list of contigs, their taxonomic assignment and the number of reads mapping? i.e. is the output here the relative abundance of the contig alone - or can it easily be used to calculate abundance at taxon levels

88-96 - so what is output at this step? A contig could have multiple ORFs - are each one evaluated separately - can you link back to the contig - is it just Kegg/Koala IDs or are their taxon name assignments also included - are these specific taxons (i.e. species/strain) or are they broader taxonomic paths also included

Table 1 - is the ordering of the table correct - the map reads step is near the end - but the description places it earlier

Github Repo - the main output is the Taxon output which is described very briefly on the GitHub repo as Level 1/2/3 Kegg Brite Hierarchical count (not mention of Koala) - can the specific output format be described in more detail - perhaps and example output file(s) from a given metagenomics set be provided

Oct 28 '22 15:10 rjorton

Hi @rjorton, how are your attempts to download and use the tool? Are the authors responding to the issues you pointed out above?

Nov 13 '22 23:11 jmschrei

Hi @jmschrei and @rjorton: I apologize for the delay in responding to these very helpful questions and comments. Upon investigating the answers we have found a few bugs in the workflow, and I am in the process of fixing them. It will take a bit longer than usual given that the first author sadly passed away a few weeks ago. I really appreciate your patience as we work through this!

Nov 14 '22 03:11 mariadelmarq

@mariadelmarq I am the AEiC on this track. I just want to say our condolences on the loss of your colleague. Please let us know how much time you need (or indeed if pursuing publication with JOSS is still possible). I'll pause this submission for now but we can resume whenever you are ready.

Dec 14 '22 14:12 Kevin-Mattheus-Moerman

@Kevin-Mattheus-Moerman @jmschrei Thank you so much for your understanding. We're hoping to reply addressing all of @rjorton's comments before the end of this week.

Dec 18 '22 22:12 mariadelmarq

@editorialbot generate pdf

Dec 23 '22 05:12 mariadelmarq

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

Dec 23 '22 05:12 editorialbot

@rjorton Thank you once again for your extremely helpful comments and questions! Responses to comments are below. Please let us know if you run into any problems downloading and testing the workflow. Happy holidays to you all!

1. 13 - "output that is useful in its default format" - what is the default format and why is it useful

We have rephrased this part of the summary. We did not consider it advisable to give lots of details on output formats in the summary but have included a whole section on this in the main document and there is an output tree available in the repository readme, which we now refer to in the main text.

2. 31-33 - clarify - so MetaGenePipe does not do any nucleotide to nucleotide comparison then - and would potentially miss non-coding sequence classification? Later on (line 74) the BLAST NT db is mentioned

The BLAST step in the Assembly workflow is indeed at the nucleotide level, but these blast outputs are not used for the final taxonomic or functional classifications, which is situated within the gene Prediction subworkflow. For that element, indeed only protein-coding sequences are used. We have rephrased the summary to clarify that BLASTn happens at a different stage in the workflow.

3. 35 - Think it would be good to give a one line explanation on each of these - what are Kegg Brite and KoalaFarm profiles (this could come under explanation of output - see later)

The summary has been compacted down but we have addressed this in main paper.

4. 49 - do you have any guidance on how it could be applied to viruses?

This is a complicated question as viral coding sequence prediction depends on the host and whether they have an RNA or DNA genome. We now cite this recent comparative analysis of techniques that helps users evaluate what suits their needs best: https://www.biorxiv.org/content/10.1101/2021.12.11.472104v1

5. 75 - the output of MetaGenePipe is not really describe in much detail anywhere. Given BLAST output is quite structured - what parsing is happening here? How does this result in more easily searchable data? Are the "no hits" in a separate file?

We have included a new section describing the main outputs and an exhaustive listing in the docs.

6. 79-87 - for quantifying relative abundance - how can this be done here? Is a file created that has a list of contigs, their taxonomic assignment and the number of reads mapping? i.e. is the output here the relative abundance of the contig alone - or can it easily be used to calculate abundance at taxon levels

The mapping results in SAM/BAM mapping files for each pair of read files, which can be used for downstream metagenome binning applications, or users could use these to obtain a list of contigs with read depth metrics by running the jgi_summarize_bam_contig_depths tool.

7. 88-96 - so what is output at this step? A contig could have multiple ORFs - are each one evaluated separately - can you link back to the contig - is it just Kegg/Koala IDs or are their taxon name assignments also included - are these specific taxons (i.e. species/strain) or are they broader taxonomic paths also included

The output tables are simply counts of ORFs matching to KEGG IDs (for the functional table) or taxa (for the OTU table). The new output section of the manuscript now specifies this. The taxonomic table does include the broader taxonomic paths.

8. Table 1 - is the ordering of the table correct - the map reads step is near the end - but the description places it earlier

Fixed

9. Github Repo - the main output is the Taxon output which is described very briefly on the GitHub repo as Level 1/2/3 Kegg Brite Hierarchical count (not mention of Koala) - can the specific output format be described in more detail - perhaps and example output file(s) from a given metagenomics set be provided

We have added substantially more information about the outputs in the manuscript and the documentation. See this link for the documentation: https://parkvilledata.github.io/MetaGenePipe/workflow.html#output

Dec 23 '22 05:12 mariadelmarq

Hi @Kevin-Mattheus-Moerman Should we take off the 'paused' label now since @mariadelmarq has responded to @rjorton 's points and updated the paper? I think everything is ready to go now.

Jan 29 '23 11:01 rbturnbull

@rbturnbull yes the editor @jmschrei can help do that if needed.

Jan 29 '23 12:01 Kevin-Mattheus-Moerman

@Ebedthan do you still have concerns about the paper? If not, would you mind checking the remaining boxes on your review? Thank you!

Jan 30 '23 17:01 jmschrei

@rjorton would you mind generating and filling out the review checklist? Instructions are in the first post.

Jan 30 '23 17:01 jmschrei

Hi @jmschrei, I'm done and everything is good for me. Thank you.

Feb 01 '23 10:02 Ebedthan

@editorialbot check references

Feb 06 '23 17:02 jmschrei

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1002/jmv.24839 is OK
- 10.1016/j.cell.2018.08.013 is OK
- 10.1038/nmeth.1923 is OK
- 10.1093/bioinformatics/btr507 is OK
- 10.1371/journal.pone.0017288 is OK
- 10.1186/1471-2105-11-119 is OK
- 10.1038/nmeth.3176 is OK
- 10.1093/bioinformatics/btp698 is OK
- 10.1186/1471-2105-10-421 is OK
- 10.1093/nar/25.17.3389 is OK
- 10.3233/WOR-2012-0507-2643 is OK
- 10.3233/wor-2012-0508-2656 is OK
- 10.3233/wor-2012-1032-2661 is OK
- 10.1016/S0022-2836(05)80360-2 is OK
- 10.1093/bioinformatics/btz859 is OK
- 10.1093/nargab/lqaa026 is OK
- 10.1007/978-1-4939-9173-0_6 is OK
- 10.1093/bioinformatics/btv033 is OK
- 10.1093/bioinformatics/bts174 is OK
- 10.1093/nar/28.1.27 is OK
- 10.1002/pro.3715 is OK
- 10.1093/nar/gkaa970 is OK
- 10.1186/gb-2014-15-3-r46 is OK
- 10.5281/zenodo.5127899 is OK
- 10.1093/bioinformatics/btu170 is OK
- 10.1007/978-1-59745-535-0_4 is OK
- 10.1093/bioinformatics/btr174 is OK
- 10.1093/bioinformatics/btp352 is OK
- 10.1093/bioinformatics/btw354 is OK
- 10.1093/bioinformatics/btab184 is OK
- 10.1007/978-1-4939-3369-3_13 is OK
- 10.1093/nargab/lqac007 is OK
- 10.1186/s12859-020-03585-4 is OK
- 10.1371/journal.pcbi.1008716 is OK
- 10.12688/f1000research.29032.1 is OK
- 10.1038/nbt.3820 is OK
- 10.1371/journal.pone.0177459 is OK
- 10.1093/bioinformatics/btab184 is OK
- 10.7490/f1000research.1114634.1 is OK
- 10.1093/bioinformatics/btz859 is OK
- 10.1093/nar/gky092 is OK
- 10.1038/s41592-021-01101-x is OK
- 10.1093/bioinformatics/bts174 is OK
- 10.1186/1471-2105-11-119 is OK
- 10.1038/s41598-020-67416-5 is OK
- 10.1038/s41598-020-67416-5 is OK
- 10.7717/peerj.7359 is OK
- 10.1101/2021.12.11.472104 is OK
- 10.1093/bioinformatics/btv383 is OK

MISSING DOIs

- None

INVALID DOIs

- None

Feb 06 '23 17:02 editorialbot

@editorialbot generate pdf

Feb 06 '23 17:02 jmschrei

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

Feb 06 '23 17:02 editorialbot

@mariadelmarq can you please provide a version for the software corresponding to this submission, and a DOI for an archive containing the submission and code, e.g. on Zenodo?

Feb 06 '23 17:02 jmschrei

joss-reviews joss-reviews copied to clipboard

[REVIEW]: MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis

Status

Reviewer instructions & questions

Checklists

Review checklist for @Ebedthan

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

joss-reviews
joss-reviews copied to clipboard