software-submission icon indicating copy to clipboard operation
software-submission copied to clipboard

BlockingPy submission

Open T-Strojny opened this issue 1 year ago • 26 comments

Submitting Author: (@T-Strojny) All current maintainers: @T-Strojny Package Name: BlockingPy One-Line Description of Package: Blocking records for record linkage and deduplication with Approximate Nearest Neighbor algorithms.; Repository Link: https://github.com/ncn-foreigners/BlockingPy Version submitted: v0.1.7 EiC: @coatless Editor: @crhea93 , @isabelizimm
Reviewer 1: @akritaag Reviewer 2: @eliotwrobson Archive: TBD JOSS DOI: TBD Version accepted: TBD Date accepted (month/day/year): TBD


Code of Conduct & Commitment to Maintain Package

Description

  • Include a brief paragraph describing what your package does: BlockingPy is a package that speeds up record linkage and deduplication tasks by using Approximate Nearest Neighbor (ANN) algorithms to create blocks with candidate record pairs. When linking or deduplicating large datasets, comparing all possible record pairs becomes computationally infeasible. BlockingPy solves this by using ANN algorithms to quickly identify similar records while significantly reducing the number of required comparisons.

Scope

  • Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):

    • [ ] Data retrieval
    • [ ] Data extraction
    • [x] Data processing/munging
    • [ ] Data deposition
    • [ ] Data validation and testing
    • [ ] Data visualization[^1]
    • [ ] Workflow automation
    • [ ] Citation management and bibliometrics
    • [ ] Scientific software wrappers
    • [ ] Database interoperability

Domain Specific

  • [ ] Geospatial
  • [ ] Education

Community Partnerships

If your package is associated with an existing community please check below:

[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.

  • For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
    • Data processing/munging : BlockingPy transforms raw data into feature vectors and applies ANN algorithms and graphs to reduce the comparison space which enables scalable record linkage and deduplication.

    • Who is the target audience and what are scientific applications of this package?
      BlockingPy is targeted for data scientists, researchers, and analysts working with large datasets that require record matching or deduplication and need a scalable approach.

    • Are there other Python packages that accomplish the same thing? If so, how does yours differ? There are many packages around Record Linkage, however ours specializes in the blocking task and uses novel approach which is the use of ANN algorithms.

    • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted: No inquiry was made

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • [x] does not violate the Terms of Service of any service it interacts with.
  • [x] uses an OSI approved license.
  • [x] contains a README with instructions for installing the development version.
  • [x] includes documentation with examples for all functions.
  • [x] contains a tutorial with examples of its essential functions and uses.
  • [x] has a test suite.
  • [x] has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

JOSS Checks
  • [ ] The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • [ ] The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • [ ] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • [ ] The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • [x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

  • [x] I have read the author guide.
  • [x] I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

T-Strojny avatar Jan 09 '25 18:01 T-Strojny

Editor in Chief checks

Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.

Please check our Python packaging guide for more information on the elements below.

  • [x] Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
    • [x] The package imports properly into a standard Python environment import package.
  • [x] Fit The package meets criteria for fit and overlap.
  • [x] Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
    • [x] User-facing documentation that overviews how to install and start using the package.
    • [x] Short tutorials that help a user understand how to use the package and what it can do for them.
    • [x] API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
  • [x] Core GitHub repository Files
    • [x] README The package has a README.md file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.
    • [x] Contributing File The package has a CONTRIBUTING.md file that details how to install and contribute to the package.
    • [x] Code of Conduct The package has a CODE_OF_CONDUCT.md file.
    • [x] License The package has an OSI approved license. NOTE: We prefer that you have development instructions in your documentation too.
  • [x] Issue Submission Documentation All of the information is filled out in the YAML header of the issue (located at the top of the issue template).
  • [x] Automated tests Package has a testing suite and is tested via a Continuous Integration service.
  • [x] Repository The repository link resolves correctly.
  • [x] Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
  • [ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
  • [ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

  • [x] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:


Editor comments

BlockingPy is in pristine shape for moving forward with a review! Nice work on getting it packaged for Python and implemented. Happy to see both mlpack and the original note on the R blocking package being emphasized.

coatless avatar Jan 21 '25 06:01 coatless

That's great to hear! Thank you for the feedback.

T-Strojny avatar Jan 21 '25 08:01 T-Strojny

@T-Strojny Thanks for your patience. I've secured an editor to further move the review along.

I am happy to announce that @isabelizimm will be the editor for your submission.

coatless avatar Mar 12 '25 21:03 coatless

Hi, just wanted to ask about any update on the review, thanks in advance!

T-Strojny avatar Apr 15 '25 20:04 T-Strojny

Hi there! I am currently reaching out to some reviewers, hoping to get the ball rolling here shortly 🤞

isabelizimm avatar Apr 28 '25 16:04 isabelizimm

hey there team. What can i do to help move this review forward? It loos like it may be stuck in the finding reviewers piece. @T-Strojny are you still around and eager to have this moved forward? If so we can try to kickstart the process!

lwasser avatar Jul 29 '25 22:07 lwasser

Hi, I can confirm that we are still interested!

BERENZ avatar Jul 30 '25 06:07 BERENZ

Ok fantastic. Thank you for the speedy reply!

We do have 2 reviewers lined up (whic is often the hardest part). We are looking for a backup editor. Someone frm our team will get back to you soon!

lwasser avatar Jul 30 '25 15:07 lwasser

Editor response to review:


Editor comments

:wave: Hi @teald and @eliotwrobson Thank you for volunteering to review for pyOpenSci! I look forward to working with you all on this!

Please fill out our pre-review survey

Before beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.

  • [ ] reviewer 1 survey completed.
  • [x] reviewer 2 survey completed.

Please let me know when you have completed this :D

The following resources will help you complete your review:

  1. Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
  2. Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due: August 31st, 2025

Reviewers: @teald @eliotwrobson Due date: 08/31/2025

crhea93 avatar Aug 04 '25 13:08 crhea93

@BERENZ @T-Strojny Hi! I'll be the new editor for your submission :) I've tagged our two reviewers who will be starting the review process. Please let me know if you have any additional questions :D

crhea93 avatar Aug 04 '25 13:08 crhea93

Apologies for the delay in reviewing (life has been extremely busy lately), but I managed to get through the items that didn't involve installation and playing around with the examples. I'll be able to get to this tomorrow.

EDIT:

I was able to download and play around with the package and have completed my review. I think this is a really useful package for anyone working with large datasets, so naturally researchers are a huge target audience. With a few changes from the items highlighted in my review, I think the package's polish could be significantly improved, making this easier to use for the target audience. Feel free to ask any questions about the items I raised, and if any are too labor intensive, we can discuss whether any of these have to be blocking.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • [x] As the reviewer, I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • [x] A statement of need clearly stating problems the software is designed to solve and its target audience in the README file.
  • [x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
  • [x] Short quickstart tutorials demonstrating significant functionality that successfully runs locally.
  • [x] Function Documentation: for all user-facing functions.
  • [x] Examples for all user-facing functions.
  • [x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • [x] Metadata including author(s), author e-mail(s), a URL, and any other relevant metadata, for example, in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

  • [x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • [x] The package name
  • [x] Badges for:
    • [x] Continuous integration and test coverage,
    • [x] Docs building (if you have a documentation website),
    • [x] Python versions supported,
    • [x] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be wider than high. A badge for pyOpenSci peer review will be provided when the package is accepted.

  • [x] Short description of package goals.
  • [x] Package installation instructions
  • [x] Any additional setup required to use the package (authentication tokens, etc.)
  • [x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
    • [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
  • [x] Link to your documentation website.
  • [x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
  • [x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. The package structure should follow the general community best practices. In general, please consider whether:

  • [x] Package documentation is clear and easy to find and use.
  • [x] The need for the package is clear
  • [x] All functions have documentation and associated examples for use
  • [x] The package is easy to install

Functionality

  • [x] Installation: Installation succeeds as documented.
  • [x] Functionality: Any functional claims of the software been confirmed.
  • [x] Performance: Any performance claims of the software been confirmed.
  • [x] Automated tests:
    • [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
    • [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
  • [x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
  • [x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
    • [x] Package supports modern versions of Python and not End of life versions.
    • [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

  • [ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • [x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:


Review Comments

  • I think including poetry installation instructions in the README just adds clutter, the pip instructions are fine on their own.
  • It looks like you're missing a more comprehensive overview of the functions / classes available in the package in the API documentation. This is especially helpful to understand the different functionality provided by the package, and is part of the items above.
  • If your code is being linted, you should add the linting as part of the automated workflow that runs your tests (or a separate workflow, but still this should be checked there somehow). It also looks like your code has type annotations, so adding mypy to your CI workflow would improve your code quality and maintainability as well.
  • You should add a link directly to the examples from the documentation in the README as part of the basic usage.
  • It was a bit tricky to get tests to run locally since the test workflow on GitHub doesn't use a virtual environment through poetry. I would suggest switching to this or using uv for this purpose.
  • I think the code coverage can be improved by covering the blocking_result.py file more, and excluding the gpu file from coverage (or using mocks to cover the code there).
  • Not a super expert about best practices here, but you use the logger to emit warnings in a few places, but I think it would be better to just use the warnings module to do this.
  • Your package includes hardcoded datasets that are mainly used for just examples, but this adds to the size of the package while including data that many users may not use or care about. What would be better is to have this example data hosted on the repo for the project, but not part of the package itself, then use the Pooch library to automate downloading of the data files to make the examples work.
  • Not a hard requirement, but as someone who isn't a super expert (but still somewhat familiar with this topic) it would be helpful to have some discussion somewhere about the different algorithms and performance metrics. It looks like these are discussed in the associated paper, so it would be awesome if some of that discussion could be added to the documentation site (even if truncated).

eliotwrobson avatar Sep 01 '25 05:09 eliotwrobson

Hi @eliotwrobson, thanks! Please note that a new version of the package arrived a couple of days ago and includes performance improvements as well as a new version that leverages GPUs for blocking (see the blockingpy-gpu version)."

BERENZ avatar Sep 01 '25 06:09 BERENZ

Thanks for the great work @eliotwrobson !

crhea93 avatar Sep 01 '25 12:09 crhea93

@teald I'm just checking in to see how the review is coming along. If you could please post your progress here.

crhea93 avatar Sep 01 '25 21:09 crhea93

@BERENZ Due to unforeseen circumstances, Teal will no longer be able to review this package. I'll be looking for and onboarding a new reviewer ASAP. Thank you for your patience.

crhea93 avatar Sep 03 '25 18:09 crhea93

@akritaag Thank you very much for agreeing to proceed as a reviewer!

crhea93 avatar Sep 03 '25 23:09 crhea93

I've filled out the pre-review survey and here is my package review as well -

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • [x] As the reviewer, I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • [x] A statement of need clearly stating problems the software is designed to solve and its target audience in the README file.
  • [x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
  • [x] Short quickstart tutorials demonstrating significant functionality that successfully runs locally.
  • [x] Function Documentation: for all user-facing functions.
  • [x] Examples for all user-facing functions.
  • [x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
  • [x] Metadata including author(s), author e-mail(s), a URL, and any other relevant metadata, for example, in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

  • [x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

  • [x] The package name
  • [x] Badges for:
    • [x] Continuous integration and test coverage,
    • [x] Docs building (if you have a documentation website),
    • [x] Python versions supported,
    • [x] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be wider than high. A badge for pyOpenSci peer review will be provided when the package is accepted.

  • [x] Short description of package goals.
  • [x] Package installation instructions
  • [x] Any additional setup required to use the package (authentication tokens, etc.)
  • [x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
    • [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
  • [x] Link to your documentation website.
  • [x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
  • [ ] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. The package structure should follow the general community best practices. In general, please consider whether:

  • [x] Package documentation is clear and easy to find and use.
  • [x] The need for the package is clear
  • [ ] All functions have documentation and associated examples for use
  • [x] The package is easy to install

Functionality

  • [x] Installation: Installation succeeds as documented.
  • [x] Functionality: Any functional claims of the software been confirmed.
  • [ ] Performance: Any performance claims of the software been confirmed.
  • [ ] Automated tests:
    • [ ] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
    • [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
  • [ ] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
  • [x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
    • [x] Package supports modern versions of Python and not End of life versions.
    • [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

  • [ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

  • [x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:


Review Comments

  1. This document was linked for a GPU version but returns a 404 error: https://blockingpy.readthedocs.io/en/latest/gpu/index.html
  2. Issue: The package defaults to using FAISS (ann="faiss") but FAISS is not included as a required dependency in pyproject.toml. Impact: Users get ModuleNotFoundError: No module named 'faiss' when trying to use the default functionality. I have created an issue and a pull request for this.
  3. Should there be examples to run other algorithms besides faiss? And scenarios where we might need other algorithms?

akritaag avatar Sep 09 '25 01:09 akritaag

Hello, thank you very much for your reviews. I just wanted to let you know that I will be able to address both of them next week.

T-Strojny avatar Sep 09 '25 13:09 T-Strojny

@T-Strojny I have a PR for the 2nd: https://github.com/ncn-foreigners/BlockingPy/issues/9

akritaag avatar Sep 09 '25 15:09 akritaag

Hello, here are the changes we've made:

Regarding @eliotwrobson’s review:

  • Removed Poetry installation instructions from the README.
  • Expanded and fixed the API section in docs.
  • Linting, formatting, type checking workflow was added (ruff, mypy).
  • Added a link to examples in the README.
  • We've added venv using uv to the workflows.
  • Coverage was improved by adding mocks for GPU file and expanding tests for blocking_result.py (60% -> 83%)
  • logger warnings were replaced with warnings module
  • Datasets are now fetched from data release on github with pooch
  • Regarding metrics: we discuss them here, and there is also some information about speed and performance here. If this is not sufficient or you had something else in mind, I’d be happy to expand the documentation further.

As for @akritaag's review:

  • the 404 error was fixed.
  • I have added instructions to CONTRIBUTING.md on how to install the packages as editables to ensure everything works correctly. This should make it easier to test, develop, and contribute to the packages in the future. I also fixed some issues in the pyproject.toml files that previously caused problems during editable installs. Everything should work now if done according to the instructions.
  • The documentation already includes examples using other algorithms such as hnsw and voyager. If this was not what you meant, could you clarify? I’ll be happy to address it.

Once again, thank you very much for your reviews.

T-Strojny avatar Sep 29 '25 14:09 T-Strojny

@T-Strojny Thank you very much for the in depth responses!

@akritaag It looks like you have accepted this review -- is that correct? I just want to verify :)

@eliotwrobson Thank you for the in depth review! When you have a free moment, would you kindly review the responses above.

Thank you all!

crhea93 avatar Sep 30 '25 18:09 crhea93

@crhea93 looks great from my end!

eliotwrobson avatar Oct 01 '25 16:10 eliotwrobson

Looks good to me as well @crhea93 💯

akritaag avatar Oct 01 '25 23:10 akritaag


🎉 BlockingPy has been approved by pyOpenSci! Thank you @T-Strojny for submitting BlockingPy and many thanks to @akritaag and @eliotwrobson for reviewing this package! 😸

Author Wrap Up Tasks

There are a few things left to do to wrap up this submission:

  • [ ] Activate Zenodo watching the repo if you haven't already done so.
  • [ ] Tag and create a release to create a Zenodo version and DOI.
  • [ ] Add the badge for pyOpenSci peer-review to the README.md of . The badge should be [![pyOpenSci Peer-Reviewed](https://pyopensci.org/badges/peer-reviewed.svg)](https://github.com/pyOpenSci/software-review/issues/issue-number).
  • [x] Please fill out the post-review survey. All maintainers and reviewers should fill this out.

Editor Final Checks

Please complete the final steps to wrap up this review. Editor, please do the following:

  • [x] Make sure that the maintainers filled out the post-review survey
  • [x] Invite the maintainers to submit a blog post highlighting their package. Feel free to use / adapt language found in this comment to help guide the author.
  • [x] Change the status tag of the issue to 6/pyOS-approved6 🚀🚀🚀.
  • [x] Invite the package maintainer(s) and both reviewers to slack if they wish to join.

If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.

crhea93 avatar Oct 01 '25 23:10 crhea93

@T-Strojny Thank you for the great work on this submission! When you have a minute, please complete the post review survey. If you wouldn't mind letting me know here when you are done, that would be most appreciated!

If you feel up to it, we invite you to submit a blog post highlighting your great work!

Finally, we have a vibrant slack community that we would like to invite you to. If you are interested in joining, please let me know :D

crhea93 avatar Oct 02 '25 00:10 crhea93

@crhea93 I have completed the survey!

T-Strojny avatar Oct 03 '25 14:10 T-Strojny