BlockingPy submission
Submitting Author: (@T-Strojny)
All current maintainers: @T-Strojny
Package Name: BlockingPy
One-Line Description of Package: Blocking records for record linkage and deduplication with Approximate Nearest Neighbor algorithms.;
Repository Link: https://github.com/ncn-foreigners/BlockingPy
Version submitted: v0.1.7
EiC: @coatless
Editor: @crhea93 , @isabelizimm
Reviewer 1: @akritaag
Reviewer 2: @eliotwrobson
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD
Code of Conduct & Commitment to Maintain Package
- [x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
- [x] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.
Description
- Include a brief paragraph describing what your package does: BlockingPy is a package that speeds up record linkage and deduplication tasks by using Approximate Nearest Neighbor (ANN) algorithms to create blocks with candidate record pairs. When linking or deduplicating large datasets, comparing all possible record pairs becomes computationally infeasible. BlockingPy solves this by using ANN algorithms to quickly identify similar records while significantly reducing the number of required comparisons.
Scope
-
Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [ ] Data visualization[^1]
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability
Domain Specific
- [ ] Geospatial
- [ ] Education
Community Partnerships
If your package is associated with an existing community please check below:
- [ ] Astropy:My package adheres to Astropy community standards
- [ ] Pangeo: My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook
[^1]: Please fill out a pre-submission inquiry before submitting a data visualization package.
-
For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
-
Data processing/munging : BlockingPy transforms raw data into feature vectors and applies ANN algorithms and graphs to reduce the comparison space which enables scalable record linkage and deduplication.
-
Who is the target audience and what are scientific applications of this package?
BlockingPy is targeted for data scientists, researchers, and analysts working with large datasets that require record matching or deduplication and need a scalable approach. -
Are there other Python packages that accomplish the same thing? If so, how does yours differ? There are many packages around Record Linkage, however ours specializes in the blocking task and uses novel approach which is the use of ANN algorithms.
-
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tagthe editor you contacted: No inquiry was made
-
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
- [x] does not violate the Terms of Service of any service it interacts with.
- [x] uses an OSI approved license.
- [x] contains a README with instructions for installing the development version.
- [x] includes documentation with examples for all functions.
- [x] contains a tutorial with examples of its essential functions and uses.
- [x] has a test suite.
- [x] has continuous integration setup, such as GitHub Actions CircleCI, and/or others.
Publication Options
- [ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:
JOSS Checks
- [ ] The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
- [ ] The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
- [ ] The package contains a
paper.mdmatching JOSS's requirements with a high-level description in the package root or ininst/. - [ ] The package is deposited in a long-term repository with the DOI:
Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
- [x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.
Confirm each of the following by checking the box.
- [x] I have read the author guide.
- [x] I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.
Please fill out our survey
- [x] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
Editor in Chief checks
Hi there! Thank you for submitting your package for pyOpenSci review. Below are the basic checks that your package needs to pass to begin our review. If some of these are missing, we will ask you to work on them before the review process begins.
Please check our Python packaging guide for more information on the elements below.
- [x] Installation The package can be installed from a community repository such as PyPI (preferred), and/or a community channel on conda (e.g. conda-forge, bioconda).
- [x] The package imports properly into a standard Python environment
import package.
- [x] The package imports properly into a standard Python environment
- [x] Fit The package meets criteria for fit and overlap.
- [x] Documentation The package has sufficient online documentation to allow us to evaluate package function and scope without installing the package. This includes:
- [x] User-facing documentation that overviews how to install and start using the package.
- [x] Short tutorials that help a user understand how to use the package and what it can do for them.
- [x] API documentation (documentation for your code's functions, classes, methods and attributes): this includes clearly written docstrings with variables defined using a standard docstring format.
- [x] Core GitHub repository Files
- [x] README The package has a
README.mdfile with clear explanation of what the package does, instructions on how to install it, and a link to development instructions. - [x] Contributing File The package has a
CONTRIBUTING.mdfile that details how to install and contribute to the package. - [x] Code of Conduct The package has a
CODE_OF_CONDUCT.mdfile. - [x] License The package has an OSI approved license. NOTE: We prefer that you have development instructions in your documentation too.
- [x] README The package has a
- [x] Issue Submission Documentation All of the information is filled out in the
YAMLheader of the issue (located at the top of the issue template). - [x] Automated tests Package has a testing suite and is tested via a Continuous Integration service.
- [x] Repository The repository link resolves correctly.
- [x] Package overlap The package doesn't entirely overlap with the functionality of other packages that have already been submitted to pyOpenSci.
- [ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly.
- [ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?
- [x] Initial onboarding survey was filled out We appreciate each maintainer of the package filling out this survey individually. :raised_hands: Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. :raised_hands:
Editor comments
BlockingPy is in pristine shape for moving forward with a review! Nice work on getting it packaged for Python and implemented. Happy to see both mlpack and the original note on the R blocking package being emphasized.
That's great to hear! Thank you for the feedback.
@T-Strojny Thanks for your patience. I've secured an editor to further move the review along.
I am happy to announce that @isabelizimm will be the editor for your submission.
Hi, just wanted to ask about any update on the review, thanks in advance!
Hi there! I am currently reaching out to some reviewers, hoping to get the ball rolling here shortly 🤞
hey there team. What can i do to help move this review forward? It loos like it may be stuck in the finding reviewers piece. @T-Strojny are you still around and eager to have this moved forward? If so we can try to kickstart the process!
Hi, I can confirm that we are still interested!
Ok fantastic. Thank you for the speedy reply!
We do have 2 reviewers lined up (whic is often the hardest part). We are looking for a backup editor. Someone frm our team will get back to you soon!
Editor response to review:
Editor comments
:wave: Hi @teald and @eliotwrobson Thank you for volunteering to review for pyOpenSci! I look forward to working with you all on this!
Please fill out our pre-review survey
Before beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.
- [ ] reviewer 1 survey completed.
- [x] reviewer 2 survey completed.
Please let me know when you have completed this :D
The following resources will help you complete your review:
- Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
- Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.
Please get in touch with any questions or concerns! Your review is due: August 31st, 2025
Reviewers: @teald @eliotwrobson Due date: 08/31/2025
@BERENZ @T-Strojny Hi! I'll be the new editor for your submission :) I've tagged our two reviewers who will be starting the review process. Please let me know if you have any additional questions :D
Apologies for the delay in reviewing (life has been extremely busy lately), but I managed to get through the items that didn't involve installation and playing around with the examples. I'll be able to get to this tomorrow.
EDIT:
I was able to download and play around with the package and have completed my review. I think this is a really useful package for anyone working with large datasets, so naturally researchers are a huge target audience. With a few changes from the items highlighted in my review, I think the package's polish could be significantly improved, making this easier to use for the target audience. Feel free to ask any questions about the items I raised, and if any are too labor intensive, we can discuss whether any of these have to be blocking.
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- [x] As the reviewer, I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- [x] A statement of need clearly stating problems the software is designed to solve and its target audience in the README file.
- [x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
- [x] Short quickstart tutorials demonstrating significant functionality that successfully runs locally.
- [x] Function Documentation: for all user-facing functions.
- [x] Examples for all user-facing functions.
- [x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
- [x] Metadata including author(s), author e-mail(s), a URL, and any other relevant metadata, for example, in a
pyproject.tomlfile or elsewhere.
Readme file requirements The package meets the readme requirements below:
- [x] Package has a README.md file in the root directory.
The README should include, from top to bottom:
- [x] The package name
- [x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be wider than high. A badge for pyOpenSci peer review will be provided when the package is accepted.
- [x] Short description of package goals.
- [x] Package installation instructions
- [x] Any additional setup required to use the package (authentication tokens, etc.)
- [x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
- [x] Link to your documentation website.
- [x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
- [x] Citation information
Usability
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. The package structure should follow the general community best practices. In general, please consider whether:
- [x] Package documentation is clear and easy to find and use.
- [x] The need for the package is clear
- [x] All functions have documentation and associated examples for use
- [x] The package is easy to install
Functionality
- [x] Installation: Installation succeeds as documented.
- [x] Functionality: Any functional claims of the software been confirmed.
- [x] Performance: Any performance claims of the software been confirmed.
- [x] Automated tests:
- [x] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
- [x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
- [x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.
A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
For packages also submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition in their submission requirements.
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).
Final approval (post-review)
- [x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
Estimated hours spent reviewing:
Review Comments
- I think including poetry installation instructions in the README just adds clutter, the pip instructions are fine on their own.
- It looks like you're missing a more comprehensive overview of the functions / classes available in the package in the API documentation. This is especially helpful to understand the different functionality provided by the package, and is part of the items above.
- If your code is being linted, you should add the linting as part of the automated workflow that runs your tests (or a separate workflow, but still this should be checked there somehow). It also looks like your code has type annotations, so adding mypy to your CI workflow would improve your code quality and maintainability as well.
- You should add a link directly to the examples from the documentation in the README as part of the basic usage.
- It was a bit tricky to get tests to run locally since the test workflow on GitHub doesn't use a virtual environment through poetry. I would suggest switching to this or using uv for this purpose.
- I think the code coverage can be improved by covering the
blocking_result.pyfile more, and excluding the gpu file from coverage (or using mocks to cover the code there). - Not a super expert about best practices here, but you use the logger to emit warnings in a few places, but I think it would be better to just use the warnings module to do this.
- Your package includes hardcoded datasets that are mainly used for just examples, but this adds to the size of the package while including data that many users may not use or care about. What would be better is to have this example data hosted on the repo for the project, but not part of the package itself, then use the Pooch library to automate downloading of the data files to make the examples work.
- Not a hard requirement, but as someone who isn't a super expert (but still somewhat familiar with this topic) it would be helpful to have some discussion somewhere about the different algorithms and performance metrics. It looks like these are discussed in the associated paper, so it would be awesome if some of that discussion could be added to the documentation site (even if truncated).
Hi @eliotwrobson, thanks! Please note that a new version of the package arrived a couple of days ago and includes performance improvements as well as a new version that leverages GPUs for blocking (see the blockingpy-gpu version)."
Thanks for the great work @eliotwrobson !
@teald I'm just checking in to see how the review is coming along. If you could please post your progress here.
@BERENZ Due to unforeseen circumstances, Teal will no longer be able to review this package. I'll be looking for and onboarding a new reviewer ASAP. Thank you for your patience.
@akritaag Thank you very much for agreeing to proceed as a reviewer!
I've filled out the pre-review survey and here is my package review as well -
Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
- [x] As the reviewer, I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).
Documentation
The package includes all the following forms of documentation:
- [x] A statement of need clearly stating problems the software is designed to solve and its target audience in the README file.
- [x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
- [x] Short quickstart tutorials demonstrating significant functionality that successfully runs locally.
- [x] Function Documentation: for all user-facing functions.
- [x] Examples for all user-facing functions.
- [x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
- [x] Metadata including author(s), author e-mail(s), a URL, and any other relevant metadata, for example, in a
pyproject.tomlfile or elsewhere.
Readme file requirements The package meets the readme requirements below:
- [x] Package has a README.md file in the root directory.
The README should include, from top to bottom:
- [x] The package name
- [x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be wider than high. A badge for pyOpenSci peer review will be provided when the package is accepted.
- [x] Short description of package goals.
- [x] Package installation instructions
- [x] Any additional setup required to use the package (authentication tokens, etc.)
- [x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
- [x] Link to your documentation website.
- [x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
- [ ] Citation information
Usability
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. The package structure should follow the general community best practices. In general, please consider whether:
- [x] Package documentation is clear and easy to find and use.
- [x] The need for the package is clear
- [ ] All functions have documentation and associated examples for use
- [x] The package is easy to install
Functionality
- [x] Installation: Installation succeeds as documented.
- [x] Functionality: Any functional claims of the software been confirmed.
- [ ] Performance: Any performance claims of the software been confirmed.
- [ ] Automated tests:
- [ ] All tests pass on the reviewer's local machine for the package version submitted by the author. Ideally this should be a tagged version making it easy for reviewers to install.
- [x] Tests cover essential functions of the package and a reasonable range of inputs and conditions.
- [ ] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
- [x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.
A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
For packages also submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition in their submission requirements.
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).
Final approval (post-review)
- [x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.
Estimated hours spent reviewing:
Review Comments
- This document was linked for a GPU version but returns a 404 error: https://blockingpy.readthedocs.io/en/latest/gpu/index.html
- Issue: The package defaults to using FAISS (ann="faiss") but FAISS is not included as a required dependency in pyproject.toml. Impact: Users get ModuleNotFoundError: No module named 'faiss' when trying to use the default functionality. I have created an issue and a pull request for this.
- Should there be examples to run other algorithms besides faiss? And scenarios where we might need other algorithms?
Hello, thank you very much for your reviews. I just wanted to let you know that I will be able to address both of them next week.
@T-Strojny I have a PR for the 2nd: https://github.com/ncn-foreigners/BlockingPy/issues/9
Hello, here are the changes we've made:
Regarding @eliotwrobson’s review:
- Removed Poetry installation instructions from the README.
- Expanded and fixed the API section in docs.
- Linting, formatting, type checking workflow was added (ruff, mypy).
- Added a link to examples in the README.
- We've added venv using uv to the workflows.
- Coverage was improved by adding mocks for GPU file and expanding tests for
blocking_result.py(60% -> 83%) - logger warnings were replaced with
warningsmodule - Datasets are now fetched from data release on github with
pooch - Regarding metrics: we discuss them here, and there is also some information about speed and performance here. If this is not sufficient or you had something else in mind, I’d be happy to expand the documentation further.
As for @akritaag's review:
- the 404 error was fixed.
- I have added instructions to
CONTRIBUTING.mdon how to install the packages as editables to ensure everything works correctly. This should make it easier to test, develop, and contribute to the packages in the future. I also fixed some issues in thepyproject.tomlfiles that previously caused problems during editable installs. Everything should work now if done according to the instructions. - The documentation already includes examples using other algorithms such as
hnswandvoyager. If this was not what you meant, could you clarify? I’ll be happy to address it.
Once again, thank you very much for your reviews.
@T-Strojny Thank you very much for the in depth responses!
@akritaag It looks like you have accepted this review -- is that correct? I just want to verify :)
@eliotwrobson Thank you for the in depth review! When you have a free moment, would you kindly review the responses above.
Thank you all!
@crhea93 looks great from my end!
Looks good to me as well @crhea93 💯
🎉 BlockingPy has been approved by pyOpenSci! Thank you @T-Strojny for submitting BlockingPy and many thanks to @akritaag and @eliotwrobson for reviewing this package! 😸
Author Wrap Up Tasks
There are a few things left to do to wrap up this submission:
- [ ] Activate Zenodo watching the repo if you haven't already done so.
- [ ] Tag and create a release to create a Zenodo version and DOI.
- [ ] Add the badge for pyOpenSci peer-review to the README.md of
. The badge should be [](https://github.com/pyOpenSci/software-review/issues/issue-number). - [x] Please fill out the post-review survey. All maintainers and reviewers should fill this out.
Editor Final Checks
Please complete the final steps to wrap up this review. Editor, please do the following:
- [x] Make sure that the maintainers filled out the post-review survey
- [x] Invite the maintainers to submit a blog post highlighting their package. Feel free to use / adapt language found in this comment to help guide the author.
- [x] Change the status tag of the issue to
6/pyOS-approved6 🚀🚀🚀. - [x] Invite the package maintainer(s) and both reviewers to slack if they wish to join.
If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.
@T-Strojny Thank you for the great work on this submission! When you have a minute, please complete the post review survey. If you wouldn't mind letting me know here when you are done, that would be most appreciated!
If you feel up to it, we invite you to submit a blog post highlighting your great work!
Finally, we have a vibrant slack community that we would like to invite you to. If you are interested in joining, please let me know :D
@crhea93 I have completed the survey!