software-submission `sourmash` submission

Submitting Author: Tessa Pierce-Ward (@bluegenes) All current maintainers: @ctb, @luizirber, @bluegenes Package Name: sourmash One-Line Description of Package: sourmash is a command line tool and Python library for sketching collections of DNA, RNA, and amino acid k-mers for biological sequence search, comparison, and analysis. Repository Link: https://github.com/sourmash-bio/sourmash Version submitted: 4.8 Editor: @snacktavish Emily Jane McTavish Reviewer 1: @LilyAnderssonLee Lili Andersson-Li Reviewer 2: @elais Elais Player Archive: TBD
Version accepted: TBD Date accepted (month/day/year): TBD

Code of Conduct & Commitment to Maintain Package

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
[x] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

Include a brief paragraph describing what your package does:

Large collections of genomes, transcriptomes, and raw sequencing data sets are readily available in biology. With the scale of data now available, the field needs lightweight computational methods for searching and summarizing the content of both public and private collections. sourmash implements FracMinHash sketching, a lightweight technique that supports accurate estimation of overlap and containment between two sequencing data sets. sourmash provides a flexible set of programmatic functionality for sequence search and comparison, together with a robust and well-tested command-line interface.

Scope

Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [ ] Data visualization[^1]
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability

Domain Specific & Community Partnerships

[ ] Geospatial
[ ] Education
[ ] Pangeo

Community Partnerships

If your package is associated with an existing community please check below:

[ ] Pangeo
- [ ] My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?

The target audience for sourmash is biologists looking to compare biological sequencing datasets of any kind. sourmash's FracMinHash sketching produces a small, irreversible representation of each dataset, allowing users to search and compare without compromising private data. We provide a user-friendly CLI with mature functionality as well as a python API and experimental Rust API for computational biologists with advanced use cases.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are a number of sketching tools for genomic data (e.g. Mash, kmindex, Dashing), each of which implements a slightly different sketching technique. sourmash's FracMinHash enables accurate comparisons of sets of very different sizes (unlike standard MinHash implemented in, e.g. Mash), enabling analysis of genomes contained within metagenomes (Irber et al. 2022), which supports taxonomic profiling (Portik et al., 2022) and content-based search of publicly-available metagenomes. sourmash now additionally supports estimation of Average Nucleotide Identity from FracMinHash (Rahman Hera et al., 2023). The sourmash team has focused on maintaining a robust user interface and continually improving functionality and user experience, with ~ monthly software releases.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

N/A

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] uses an OSI approved license.
[x] contains a README with instructions for installing the development version.

in 'developer notes', not main README.

[x] includes documentation with examples for all functions.
[x] contains a tutorial with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

[x] The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
[x] The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
[x] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
[x] The package is deposited in a long-term repository with the DOI: 10.21105/joss.00027

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

[x] I have read the author guide.
[x] I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

[x] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Aug 14 '23 21:08 bluegenes

Thank you @bluegenes for this detailed submission! I am adding initial checks here:

[x] Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
- [x] The package imports properly into a standard Python environment import package-name.
[x] Fit The package meets criteria for fit and overlap.
[x] Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
- [x] User-facing documentation that overviews how to install and start using the package.
- [x] Short tutorials that help a user understand how to use the package and what it can do for them.
- [x] API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
[x] Core GitHub repository Files
- [x] README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
- [x] Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
- [x] Code of Conduct The package has a Code of Conduct file.
- [x] License The package has an OSI approved license. NOTE: We prefer that you have development instructions in your documentation too.
[x] Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
[x] Automated tests Package has a testing suite and is tested via a Continuous Integration service.
[x] Repository The repository link resolves correctly.
[x] Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
[x] Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
[x] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

[ ] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:

Aug 15 '23 19:08 NickleDave

@bluegenes @ctb @luizirber when you have a chance can all three of you please fill out the pre-survey review linked above?

In the meantime I will proceed with finding an editor 🙂

Aug 15 '23 19:08 NickleDave

Sorry for the delay! I am signed on a guest editor and will proceed with finding reviewers. Looking forward to checking out your package.

Oct 02 '23 22:10 snacktavish

I have contacted some reviewers! Hope to have reviewers onboard soon.

Oct 03 '23 23:10 snacktavish

Awesome, thank you for the updates @snacktavish, we really appreciate you acting as editor for this one. Ok I'll let you take over now :zipper_mouth_face:

Oct 04 '23 00:10 NickleDave

Waiting on responses from reviewers - got some No's, but also some recommendations of possible reviewers. Asking more folks and hoping for some yes's!

Oct 17 '23 19:10 snacktavish

Emailing some more potential reviewers today! @bluegenes @ctb @luizirber do you have any reviewer suggestions?

Oct 25 '23 20:10 snacktavish

@snacktavish would it be helpful for us to post about this specific review on social? And then we also have a connection with the single cell community if that might be helpful. Finally, we have a reviewer spreadsheet that might be helpful - people sign up to review for us. if you join us in our pyOpenSci slack we can chat and provide some additional resources that may help in this search for reviewers. you can email me at leah at pyopensci.org ✨

Oct 27 '23 01:10 lwasser

Great! We have @LilyAnderssonLee signed on as a reviewer, and she will get started next week. I also have a few more emails out to other folks, hopefully will have a second on-boarded shortly as well!

Nov 06 '23 16:11 snacktavish

Editor comments

:wave: Hi @LilyAnderssonLee and @elais! Thank you for volunteering to review for pyOpenSci!

Please fill out our pre-review survey

Before beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.

[x] @LilyAnderssonLee survey completed.
[x] @elais survey completed.

The following resources will help you complete your review:

Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due: Friday, December 8

Nov 15 '23 05:11 snacktavish

Hi @LilyAnderssonLee and @elais - we would appreciate getting this review at the end of next week! Thank you,

Dec 01 '23 22:12 snacktavish

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [x] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software
[x] Authors: A list of authors with their affiliations
[x] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

Review Comments

@bluegenes Outstanding work with sourmash! Your commitment to creating a package that's both easily maintainable and well-documented truly shines through. The code is impressively organized, accompanied by clear comments explaining each section, making it easy to comprehend the purpose of each file and function. Furthermore, the documentation for high-level files and directories is exceptionally clear — I intend to adopt this clarity in my own work. I've noted a few minor comments for your consideration.

1. mamba was not callable after installing miniforge3 on my Mac OS.
1. Is it possible to include arguments to allow users to adjust the legends of both matrix and dendrogram plots in the function mentioned in the example: sourmash plot --pdf --lables ecoli_cmp ?
1. Regarding benchmarks/benchmarks.py :
  - 1. Unused libraries: os & Path
  - 1. Magic numbers: There are some magic numbers in the code (e.g., 500, 21, 3000, 10000, 1000, 20). Is it possible to define these numbers as constants with meaningful names or add the comments to make the code more self-explanatory?
1. The download link doesn't work in the section #Building your own LCA database. The correct link should be curl -L https://osf.io/bw8d7/download -o delmont-subsample-sigs.tar.gz
1. Regarding test/test_sourmash.py : Refactoring duplicate code into helper functions to avoid redundancy and improve maintainability. For instance:

def load_signatures(testsigs, ksize=21, moltype='dna'):
    """Load signatures from files."""
    sigs = []
    for fn in testsigs:
        sigs.append(sourmash.load_one_signature(fn, ksize=ksize, select_moltype=moltype))
    return sigs

def test_compare_serial(runtmp):
    """Test compare command serially."""
    c = runtmp

    testsigs = utils.get_test_data('genome-s1*.sig')
    testsigs = glob.glob(testsigs)

    c.run_sourmash('compare', '-o', 'cmp', '-k', '21', '--dna', *testsigs)

    cmp_outfile = c.output('cmp')
    assert os.path.exists(cmp_outfile)
    cmp_out = numpy.load(cmp_outfile)

    sigs = load_signatures(testsigs)

    cmp_calc = numpy.zeros([len(sigs), len(sigs)])
    for i, si in enumerate(sigs):
        for j, sj in enumerate(sigs):
            cmp_calc[i][j] = si.similarity(sj)

        sigs = load_signatures(testsigs)

    assert (cmp_out == cmp_calc).all()

1. I may have overlooked it in your documentation. It would be helpful if you could provide some recommendations for the selection of k-mers and scales.
1. What are the consequences of using different scales? Do you recommend using the same scale to generate the signatures for genomes or databases?

sourmash gather ecoli-genome.sig inter/genbank-k31.lca.json.gz 
WARNING: final scaled was 10000, vs query scaled of 1000

1. Is it possible to add explanations and default values to --threshold in the help message of sourmash lca classify -h

Dec 04 '23 10:12 LilyAnderssonLee

thanks for this review! We'll get back to you shortly ;)

Dec 06 '23 14:12 ctb

Thanks @LilyAnderssonLee!

Dec 11 '23 23:12 snacktavish

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[X] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[X] Vignette(s) demonstrating major functionality that runs successfully locally.
[X] Function Documentation: for all user-facing functions.
[X] Examples for all user-facing functions.
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[X] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[X] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[X] The package name
[ ] Badges for:
- [X] Continuous integration and test coverage,
- [X] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [X] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[X] Short description of package goals.
[X] Package installation instructions
[X] Any additional setup required to use the package (authentication tokens, etc.)
[X] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [X] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[X] Link to your documentation website.
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[X] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[X] Package documentation is clear and easy to find and use.
[X] The need for the package is clear
[X] All functions have documentation and associated examples for use
[X] The package is easy to install

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests:
- [X] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [X] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
[X] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [X] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[X] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[X] A short summary describing the high-level functionality of the software
[X] Authors: A list of authors with their affiliations
[X] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

Review Comments

@bluegenes This is one of the best maintained packages I've seen. The development environment is easy to setup and work in, each commit has meaningful messages attached, and the documentation is extensive and up to date. I plan on showing this repository to my own team concerning how projects should look when maintained well. I ran into a few hiccups along the way however

The nix flake and envrc needs to be updated to make sure the entire environment is available with a simple call to direnv allow.
There are unused imports in benchmarks/benchmarks.py
Not all tests ran out of the box in the dev environment via the nix flake but were easily made to work and seem to work in the GitHub actions.
Flake8 revealed a lot of errors, it might not be the worst idea to run code through a linter and fix everything that can be automatically fixed, this may help with the unused imports as well. These are none critical but it helps.

Dec 27 '23 19:12 elais

Thank you reviewers! I’ll look over these when I’m back at work next week.

Jan 04 '24 23:01 snacktavish

OK! I looked over the reviews, which were both very positive. Congratulations @bluegenes on your very nice package! Post here when the reviews have been addressed and the reviewers will look over the changes. Thanks @elais and @LilyAnderssonLee for your thoughtful reviews!

Jan 12 '24 17:01 snacktavish

a tracking update - not yet a response ;) - because life does not proceed without github issues!

@elais review issues:

The nix flake and envrc needs to be updated to make sure the entire environment is available with a simple call to direnv allow.

https://github.com/sourmash-bio/sourmash/issues/2905

There are unused imports in benchmarks/benchmarks.py

https://github.com/sourmash-bio/sourmash/issues/2906

Not all tests ran out of the box in the dev environment via the nix flake but were easily made to work and seem to work in the GitHub actions.

https://github.com/sourmash-bio/sourmash/issues/2907

Flake8 revealed a lot of errors, it might not be the worst idea to run code through a linter and fix everything that can be automatically fixed, this may help with the unused imports as well. These are none critical but it helps.

[ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

https://github.com/sourmash-bio/sourmash/issues/2908

[ ] A repostatus.org badge,

https://github.com/sourmash-bio/sourmash/issues/2909

[ ] Python versions supported,

https://github.com/sourmash-bio/sourmash/issues/2910

If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.

https://github.com/sourmash-bio/sourmash/issues/2911

* [ ]  **References:** With DOIs for all those that have one (e.g. papers, datasets, software).

https://github.com/sourmash-bio/sourmash/issues/2912

Jan 13 '24 17:01 ctb

tracking links for @LilyAnderssonLee review -

* [ ]  Badges for:
  * [ ]  A [repostatus.org](https://www.repostatus.org/) badge,
  * [ ]  Python versions supported,
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

https://github.com/sourmash-bio/sourmash/issues/2909
https://github.com/sourmash-bio/sourmash/issues/2910

* [ ]  If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.

https://github.com/sourmash-bio/sourmash/issues/2911

* [ ]  **Packaging guidelines**: The package conforms to the pyOpenSci [packaging guidelines](https://www.pyopensci.org/python-package-guide).
  * [ ]  Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

https://github.com/sourmash-bio/sourmash/issues/2908

For packages also submitting to JOSS

* [ ]  **References:** With DOIs for all those that have one (e.g. papers, datasets, software).

https://github.com/sourmash-bio/sourmash/issues/2912

* 1. `mamba` was not callable after installing `miniforge3` on my Mac OS.

https://github.com/sourmash-bio/sourmash/issues/2916

* 2. Is it possible to include arguments to allow users to adjust the legends of both matrix and dendrogram plots in the function mentioned in the example: `sourmash plot --pdf --lables ecoli_cmp` ?

https://github.com/sourmash-bio/sourmash/issues/2915

* 3. Regarding `benchmarks/benchmarks.py` :
     
     * 1. Unused libraries: `os` & `Path`
     * 2. Magic numbers: There are some magic numbers in the code (e.g., 500, 21, 3000, 10000, 1000, 20). Is it possible to define these numbers as constants with meaningful names or add the comments to make the code more self-explanatory?

https://github.com/sourmash-bio/sourmash/issues/2906
https://github.com/sourmash-bio/sourmash/issues/2913

* 4. The download link doesn't work in the section [#Building your own LCA database](https://sourmash.readthedocs.io/en/latest/tutorials-lca.html). The correct link should be `curl -L https://osf.io/bw8d7/download -o delmont-subsample-sigs.tar.gz`

https://github.com/sourmash-bio/sourmash/issues/2914

* 4. Regarding `test/test_sourmash.py` : Refactoring duplicate code into helper functions to avoid redundancy and improve maintainability.

https://github.com/sourmash-bio/sourmash/issues/2917

* 5. I may have overlooked it in your documentation. It would be helpful if you could provide some recommendations for the selection of `k-mers` and `scales`.
* 6. What are the consequences of using different scales? Do you recommend using the same scale to generate the signatures for genomes or databases?

sourmash gather ecoli-genome.sig inter/genbank-k31.lca.json.gz 
WARNING: final scaled was 10000, vs query scaled of 1000

https://github.com/sourmash-bio/sourmash/issues/2918

* 7. Is it possible to add explanations and default values to `--threshold` in the help message of `sourmash lca classify -h`

https://github.com/sourmash-bio/sourmash/issues/2919

Jan 13 '24 17:01 ctb

The nix flake and envrc needs to be updated to make sure the entire environment is available with a simple call to direnv allow.

re https://github.com/sourmash-bio/sourmash/issues/2905: @elais, are you per chance running this on ARM macOS? This works on Linux and I tried on x86_64 macOS before, but I don't have access to an ARM macOS to try it out...

Jan 14 '24 01:01 luizirber

Yes I am running it on an M1 Mac. I could try running it in a virtual machine to see if I get better results.

Jan 22 '24 17:01 elais

Yes I am running it on an M1 Mac. I could try running it in a virtual machine to see if I get better results.

I think I found the underlying problem on an x86_64 macOS, will work on fixing it. Thanks for raising the issue!

Jan 23 '24 02:01 luizirber

Hi @LilyAnderssonLee , @elais and @snacktavish,

Thank you so much for your reviews and your enthusiasm for sourmash! We have put a lot of effort into maintenance over the past few years, and it's fantastic to get some external feedback for improvement.

@ctb, @luizirber, and I (the sourmash maintainers) have addressed each of your comments via the issues posted above. As those address each point in-line, I will summarize here and provide text and explanations where needed.

We have cut a new release with these changes (v4.8.6), and plan to release version v4.9.0 when accepted and we have a new DOI.

JOSS manuscript fixes

ensure all citations have DOI and update citation for recently published preprint (https://github.com/sourmash-bio/sourmash/pull/2964)

README updates

Badges added:

repostatus, pyver python versions
platform support (confirm windows support)

Comparison to similar packages:

Our goal with sourmash is to maintain a swiss-army knife of k-mer based sequence analysis tools for lightweight sequence analysis. There are a myriad of software tools that conduct one or a few sourmash functions, such as pairwise sequence comparisons (e.g. Mash) or taxonomic classification (Kraken, sylph, etc), albeit with different approaches. To avoid a laundry list of comparisons, we have instead added the following text to the README, which we hope helps distinguish sourmash:

sourmash is a k-mer analysis multitool, and we aim to provide stable, robust programmatic and command-line APIs for a variety of sequence comparisons. Some of our special sauce includes:

FracMinHash sketching, which enables accurate comparisons (including ANI) between data sets of different sizes

sourmash gather, a combinatorial k-mer approach for more accurate metagenomic profiling Please see the sourmash publications for details.

Documentation Improvements

k-mer and scaling recommendations

In the FAQ, we have added adjusted information on k-mer size and added a section for scaled selection to the FAQ. These can be seen here: https://sourmash.readthedocs.io/en/latest/faq.html#what-k-mer-size-s-should-i-use-with-sourmash
The FAQ also links to the relevant section here: https://sourmash.readthedocs.io/en/latest/using-sourmash-a-guide.html#what-k-mer-size-s-should-i-use
fix download link in LCA tutorial (https://github.com/sourmash-bio/sourmash/pull/2920/files)

Installation:

we checked that the tutorial directs users to install the right version of miniforge for their operating system (installing the linux version on mac will not be usable).
Remove bioconda requirement, as sourmash-minimal installs from conda-forge.
nix flake and envrc - update for arm-based macs (https://github.com/sourmash-bio/sourmash/pull/2975)

User Experience Improvements

Added --labels-to and --labels-from to allow user customization of plot labels with sourmash plot
Added details to --threshold argument in LCA classify
- help="minimum number of hashes needed for a taxonomic classification (default: 5)")

Code format standardization

fixed unused imports in benchmarks/benchmarks.py
refactored benchmarks.py to label relevant constants used for testing (e.g. MinHash k-mer size and the number of times to repeat each function for benchmarking)
lint all files via ruff and set up a linting pre-commit hook to lint+fix all PR's (https://github.com/sourmash-bio/sourmash/pull/2427)

Duplicated code in `test_sourmash.py`

While we absolutely agree that helper functions improve maintainability, the duplicated code in our tests is a result of a specific testing philosophy: tests should be simple, "flat" (not call a lot of test-specific functions), have low code complexity (minimize if and for statements), and otherwise not be viewed as primary code to be maintained and refactored regularly.

This has indeed led to a lot of repeated code in the tests (which currently number 2753!), but often these tests must load or modify test data in a unique manner in order to trigger edge cases and check bugfixes we've completed. So we want to avoid editing the tests and potentially reducing branch coverage and their coverage of edge cases.

For the future, we would appreciate any suggestions on large-scale ways to discover and/or fix instances of widely-shared code blocks and update some of the legacy items (e.g. decorators being used rather than our updated text fixtures).

Feb 12 '24 19:02 bluegenes

@bluegenes @ctb, @luizirber Thank you for your effort in addressing those comments. I am very pleased with your response. Well done!

Feb 13 '24 07:02 LilyAnderssonLee

hi folks, please let me know if there's anything more we need to do! thanks!

Mar 05 '24 17:03 ctb

Hi all, Sorry for the delay! @bluegenes @ctb, @luizirber great work!!

@elais - were your comments sufficiently addressed? Any further comments to add on the updated version?

Thanks everyone!

Mar 08 '24 15:03 snacktavish

hi friends! 👋 thank you everyone for all of the work you've done here. it looks like we could use some input from @elais on whether comments were addressed. @elais can you kindly respond here and let us know so we can keep things moving ✨ many many thanks! 👐

Mar 29 '24 15:03 lwasser

yay!! I heard back, and you are officially good to go! Congratulations and I apologize for the slow review process. I will post the wrap up checklist now!!

Apr 04 '24 22:04 snacktavish

🎉Sourmash has been approved by pyOpenSci! Thank you @bluegenes @luizirber and @ctb for submitting sourmash and many thanks to @elais and @LilyAnderssonLee for reviewing this package! 😸

Author Wrap Up Tasks

There are a few things left to do to wrap up this submission:

[x] Activate Zenodo watching the repo if you haven't already done so.
[x] Tag and create a release to create a Zenodo version and DOI.
[x] Add the badge for pyOpenSci peer-review to the README.md of . The badge should be [![pyOpenSci](https://tinyurl.com/y22nb8up)](https://github.com/pyOpenSci/software-review/issues/issue-number).
[x] Please fill out the post-review survey. All maintainers and reviewers should fill this out.

It looks like you would like to submit this package to JOSS. Here are the next steps:

[x] <Editor Task> Once the JOSS issue is opened for the package, we strongly suggest that you subscribe to issue updates. This will allow you to continue to update the issue labels on this review as it goes through the JOSS process.
[x] Login to the JOSS website and fill out the JOSS submission form using your Zenodo DOI. When you fill out the form, be sure to mention and link to the approved pyOpenSci review. JOSS will tag your package for expedited review if it is already pyOpenSci approved.
[x] Wait for a JOSS editor to approve the presubmission (which includes a scope check).
[ ] Once the package is approved by JOSS, you will be given instructions by JOSS about updating the citation information in your README file.
[ ] When the JOSS review is complete, add a comment to your review in the pyOpenSci software-review repo here that it has been approved by JOSS. An editor will then add the JOSS-approved label to this issue.

Editor Final Checks

Please complete the final steps to wrap up this review. Editor, please do the following:

[ ] Make sure that the maintainers filled out the post-review survey
[x] Invite the maintainers to submit a blog post highlighting their package. Feel free to use / adapt language found in this comment to help guide the author.
[x] Change the status tag of the issue to 6/pyOS-approved6 🚀🚀🚀.
[x] Invite the package maintainer(s) and both reviewers to slack if they wish to join.
[x] If the author submits to JOSS, please continue to update the labels for JOSS on this issue until the author is accepted (do not remove the 6/pyOS-approved label). Once accepted add the label 9/joss-approved to the issue. Skip this check if the package is not submitted to JOSS.
[ ] If the package is JOSS-accepted please add the JOSS doi to the YAML at the top of the issue.

If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.

Apr 04 '24 22:04 snacktavish

thanks all! really happy! :tada:

question: if we want to go for JOSS, should we do a new release now, for pyopensci, and then another, assuming JOSS accepts? or can we wait and just do a release if/when JOSS accepts? we don't have a real reason for a new release just yet (we've been doing them regularly anyway!) so could go either way. Just curious.

(we have citation instructions in the software that we would change for the JOSS release.)

Apr 06 '24 14:04 ctb

software-submission software-submission copied to clipboard

`sourmash` submission

Code of Conduct & Commitment to Maintain Package

Description

Scope

Community Partnerships

Technical checks

Publication Options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Please fill out our survey

Editor and Review Templates

Editor comments

Please fill out our pre-review survey

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

For packages also submitting to JOSS

JOSS manuscript fixes

README updates

Badges added:

Comparison to similar packages:

Documentation Improvements

k-mer and scaling recommendations

Installation:

User Experience Improvements

Code format standardization

Duplicated code in test_sourmash.py

Author Wrap Up Tasks

Editor Final Checks

software-submission
software-submission copied to clipboard

Duplicated code in `test_sourmash.py`