joss-reviews
joss-reviews copied to clipboard
[REVIEW]: SHED: Streaming Heterogeneous Event Data Tracking with Provenance
Submitting author: @CJ-Wright (Christopher Wright) Repository: https://github.com/xpdAcq/SHED Branch with paper.md (empty if default branch): joss-paper Version: 0.7.5 Editor: @arfon Reviewers: @remram44, @Midnighter Archive: Pending
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
Status
Status badge code:
HTML: <a href="https://joss.theoj.org/papers/89f9042e51bad3e80b14e88d0fdadc73"><img src="https://joss.theoj.org/papers/89f9042e51bad3e80b14e88d0fdadc73/status.svg"></a>
Markdown: [](https://joss.theoj.org/papers/89f9042e51bad3e80b14e88d0fdadc73)
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@remram44, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
- Make sure you're logged in to your GitHub account
- Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @arfon know.
β¨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest β¨
Review checklist for @remram44
Conflict of interest
- [x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.
Code of Conduct
- [x] I confirm that I read and will adhere to the JOSS code of conduct.
General checks
- [x] Repository: Is the source code for this software available at the repository url?
- [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
- [x] Contribution and authorship: Has the submitting author (@CJ-Wright) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
- [x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
Functionality
- [x] Installation: Does installation proceed as outlined in the documentation?
- [ ] Functionality: Have the functional claims of the software been confirmed?
- [ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)
Documentation
- [ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
- [ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
- [x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
- [ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
- [x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
- [ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
Software paper
- [x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
- [x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
- [ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?
- [x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
- [x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @remram44 it looks like you're currently assigned to review this paper :tada:.
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
:star: Important :star:
If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews πΏ
To fix this do the following two things:
- Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

- You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

For a list of things I can do to help you, just type:
@whedon commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@whedon generate pdf
Software report (experimental):
github.com/AlDanial/cloc v 1.88 T=0.18 s (219.0 files/s, 57135.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 25 771 827 3276
reStructuredText 4 212 28 330
DOS Batch 1 34 2 239
make 1 32 5 204
YAML 4 11 7 154
Jupyter Notebook 2 0 3928 79
Markdown 1 4 0 30
INI 1 0 0 2
-------------------------------------------------------------------------------
SUM: 39 1064 4797 4314
-------------------------------------------------------------------------------
Statistical information for the repository 'b7d42d849242488e7a4a546b' was
gathered on 2021/03/17.
The following historical commit information, by author, was found:
Author Commits Insertions Deletions % of changes
Christopher J. Wrigh 100 2633 1171 6.18
Julien L 6 63 12 0.12
Julien Lhermitte 13 94 35 0.21
XPD Operator 2 27 6 0.05
christopher 388 30281 26985 93.06
st3107 11 107 122 0.37
Below are the number of rows from each author that have survived and are still
intact in the current revision:
Author Rows Stability Age % in comments
Julien Lhermitte 4 4.3 35.8 25.00
christopher 4775 15.8 21.9 8.67
st3107 95 88.8 1.0 0.00
PDF failed to compile for issue #3119 with the following error:
Can't find any papers to compile :-(
@remram44 β This is the review thread for the paper. All of our communications will happen here from now on.
Please read the "Reviewer instructions & questions" in the first comment above.
Both reviewers have checklists at the top of this thread (in that first comment) with the JOSS requirements. As you go over the submission, please check any items that you feel have been satisfied. There are also links to the JOSS reviewer guidelines.
The JOSS review is different from most other journals. Our goal is to work with the authors to help them meet our criteria instead of merely passing judgment on the submission. As such, the reviewers are encouraged to submit issues and pull requests on the software repository. When doing so, please mention https://github.com/openjournals/joss-reviews/issues/3119 so that a link is created to this thread (and I can keep an eye on what is happening). Please also feel free to comment and ask questions on this thread. In my experience, it is better to post comments/questions/suggestions as you come across them instead of waiting until you've reviewed the entire package.
We aim for the review process to be completed within about 4-6 weeks but please make a start well ahead of this as JOSS reviews are by their nature iterative and any early feedback you may be able to provide to the author will be very helpful in meeting this schedule.
@whedon generate pdf from branch joss-paper
Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
(@arfon: I see the pre-review issue has been closed, but we are missing another reviewer; is that protocol?)
I have taken a first look at the repositories. Some notes:
- General:
shedis taken on PyPI, by https://github.com/Zac-HD/shed. It might be beneficial to use a unique name, such asshed-data,shed-streaming, etc to avoid conflicting in people's environments. Should the othersheddecide to publish on conda-forge, things will get weird (I was surprised this was not discussed there https://github.com/conda-forge/staged-recipes/pull/4804). - Installation instructions: Installing from the repo is not documented, only using the package from conda-forge. I think this should be added if people are to contribute or fork the project.
- I was able to install from the
requirements/files, but noted thatpip.txtis empty, which is probably not intended?
- I was able to install from the
- Automated tests: the tests don't run for me, possibly because
pytest-tornadois the wrong version. I note that you use an additional Conda channel on Travis,nsls2forge, which seems to be your own. - Automated tests: you do use Travis for CI, you might want to include a link to the Travis build page (for example their badge) https://travis-ci.org/github/xpdAcq/SHED
- Functionality documentation: The documentation page https://xpdacq.github.io/SHED/shed.html shows up as empty except for headers. It looks like documentation exists and there is just something wrong with autodoc.
- Documentation: The rest of the documentation is pretty sparse. There is little explanation of the purpose of the software; maybe some text can be copied from the paper.
- Tutorials: They build on NSLS-II data standards and the
blueskyframework, which might not be widely known/used outside of the Brookhaven National Laboratory. It also usesrapidz, which is a fork ofstreamzfrom the same group. The paper mentionsstreamzbutrapidzis actually what's used throughout the package, except for one example file (average.py). - Paper: The paper looks good, I sent some nit-picks at https://github.com/xpdAcq/SHED/pull/202
- References / State of the field: Maybe more references could be added to modern streaming dataflow system/models such as Timely dataflows which are popular recently? In addition, I think more details on what SHED adds compared to
daskandstreamz(that it builds upon) orraymight be interesting?
I will look into the code next.
I also note that the main contributors of the code are not watching the GitHub repository, which will probably cause problems in the case of external contributions or reports.
:wave: @remram44, please update us on how your review is going (this is an automated reminder).
I did :thinking:
Thanks for your hard work on this @remram44 . You are right, that somehow I wasn't watching the repo and I didn't get a notification so I missed all your hard work. I am now watching. Sorry, this is the first time I did this process at JOSS. We will work on the suggestions. @CJ-Wright Looping you in. I will do what I can. There are some things we may want to chat about.
:wave: @alesarrett βΒ you be willing to review this submission for JOSS? We carry out our checklist-driven reviews here in GitHub issues and follow these guidelines: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html
We already have one reviewer (@remram44) and are looking to find a second.
The package under review is SHED: Streaming Heterogeneous Event Data Tracking with Provenance, a Python tool for streaming data from event models, while tracking data provenance and workflow execution.
Dear Arfon,
I would be willing to help with reviewing, but this is my paper, so it is not so appropriate for me to review it!
S
On Sun, Jun 27, 2021 at 4:59 AM Arfon Smith @.***> wrote:
π @alesarrett https://github.com/alesarrett β you be willing to review this submission for JOSS? We carry out our checklist-driven reviews here in GitHub issues and follow these guidelines: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html
We already have one reviewer @.*** https://github.com/remram44) and are looking to find a second.
The package under review is SHED: Streaming Heterogeneous Event Data Tracking with Provenance https://github.com/xpdAcq/SHED, a Python tool for streaming data from event models, while tracking data provenance and workflow execution.
β You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/3119#issuecomment-869127342, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAOWUPAYVWEAZZVSCHQVR3TU3SALANCNFSM4ZKDLD5A .
-- Simon Billinge Professor, Columbia University Physicist, Brookhaven National Laboratory
I would be willing to help with reviewing, but this is my paper, so it is not so appropriate for me to review it!
Of course! Unless I'm mistaken somehow, I was actually inviting @alesarrett to review here.
I will look into the code next.
@remram44 βΒ just checking in to see how you're getting on completing your review?
Thanks @arfon, I must say I haven't been prioritizing this since we are missing the second reviewer. I was also hoping some of my points in https://github.com/openjournals/joss-reviews/issues/3119#issuecomment-809781527 would be addressed, but that doesn't block me from running the software so I will do that ASAP.
π @arfon - Can you provide an update on this submission and review?
@CJ-Wright @sbillinge β I'm struggling to find a second reviewer for this submission. Could you take a look at this spreadsheet and identify some possible candidates?
Also, if there are people in your field who you think would be suitable, please mention them here too.
wave @alesarrett β you be willing to review this submission for JOSS? We carry out our checklist-driven reviews here in GitHub issues and follow these guidelines: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html
We already have one reviewer (@remram44) and are looking to find a second.
The package under review is SHED: Streaming Heterogeneous Event Data Tracking with Provenance, a Python tool for streaming data from event models, while tracking data provenance and workflow execution.
Sorry to be so late in replying. Anyway, I'm not able to hep with this review, sorry.
Documenting some additional activity on my part here: I've just emailed @CJ-Wright & @sbillinge to ask them:
- To respond to @remram44's initial review feedback above
- To assist with finding a second reviewer.
@CJ-Wright & @sbillinge β it's close to 8 months since we've heard back from you about this submission. At this point I'm assuming you're no longer interested in pursuing this submission in JOSS.
If we don't hear back from you in the next week, I'll reject this submission due to lack of activity.
Sorry about this Arfon. My student CJ left the group with things in this state. I will check with him how we wants to proceed but he will have to commit to spending some time on it for it to work out.
S
On Fri, Feb 18, 2022 at 3:34 PM Arfon Smith @.***> wrote:
@CJ-Wright https://github.com/CJ-Wright & @sbillinge https://github.com/sbillinge β it's close to 8 months since we've heard back from you about this submission. At this point I'm assuming you're no longer interested in pursuing this submission in JOSS.
If we don't hear back from you in the next week, I'll reject this submission due to lack of activity.
β Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/3119#issuecomment-1045144122, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAOWUKACEBL4DYIA7PHFP3U32UNDANCNFSM4ZKDLD5A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
-- Simon Billinge Professor, Columbia University Physicist, Brookhaven National Laboratory
Dear Arfon,
OK, so my current student Songsheng @st3107 who is also a contributor and maintainer of the code, and a coauthor on the paper, has agreed to take over from CJ to shepherd this home. We took a look at where everything is and from our perspective the things requested in @remram44's review so far are excellent suggestions and all quite do-able.
For a second reviewer, if it is not a confilct of interest, I think a person from the NSLS-II facility at Brookhaven National Lab where this code is deployed and in use at the XPD and PDF beamlines would be best. The best person there would be Dan Olds ([email protected]) as he is in charge of the software at PDF and very familiar with the code (though of course we work with him which may be a conflict issue). Other possibilities would be people who are familiar with the bluesky/event-model (https://github.com/bluesky/event-model) that SHED is built on, or bluesky in general (https://github.com/bluesky/) but not affiliated with BNL.
Dear @arfon and @remram44 . Thanks for your review and your patience. We have now responded to all the issues that were raised in @remram44 's initial review. I will post our responses below.
The second issue @arfon raised was to help finding a second referee. Do you want suggestions of names or do you want me to reach out directly to people? Do you have input on the conflict of issue aspects I raised above? The best balance would presumably be someone who is familiar with the event document-model but is not at Brookhaven National Lab. I am now collecting good names from the BNL developers and I can paste them below.
Here are responses to the @remram44 review points from above:
- [x] General: shed is taken on PyPI, by https://github.com/Zac-HD/shed. It might be beneficial to use a unique name, such as shed-data, shed-streaming, etc to avoid conflicting in people's environments. Should the other shed decide to publish on conda-forge, things will get weird (I was surprised this was not discussed there
We changed the project name to shed-streaming and the module name to shed_streaming. The new package is released in the conda-forge channel.
- [x] Installation instructions: Installing from the repo is not documented, only using the package from conda-forge. I think this should be added if people are to contribute or fork the project. I was able to install from the requirements/ files, but noted that pip.txt is empty, which is probably not intended?
We deleted the pip.txt since it is not used in the installation. All dependencies are installed using conda.
We add a bash script for users to install the package using the source repo and a section about how to use it in the installaion instructions.
- [x] Automated tests: the tests don't run for me, possibly because pytest-tornado is the wrong version. I note that you use an additional Conda channel on Travis, nsls2forge, which seems to be your own.
The issue is probably related to the dask version. We updated the tests so that the tests used asyncio together with the latest dask. Now, tests could be run without errors. Clone the repo and run the following commands to run the tests locally.
bash install.sh
conda activate <name of the environment>
python run_tests.py
The nsls2forge is a channel that used to be maintained by the NSLS-II DSSI group. Now, the shed (renamed as shed-streaming) with all its dependencies are released on the conda-forge channel.
- [x] Automated tests: you do use Travis for CI, you might want to include a link to the Travis build page (for example their badge) https://travis-ci.org/github/xpdAcq/SHED
We included the link to the travis build. Our group runs out of the credits and thus the CI cannot be run now. We have applied for more credits
- [x] Functionality documentation: The documentation page https://xpdacq.github.io/SHED/shed.html shows up as empty except for headers. It looks like documentation exists and there is just something wrong with autodoc.
We fixed the sphinx autodoc and published the new webpage.
- [x] Documentation: The rest of the documentation is pretty sparse. There is little explanation of the purpose of the software; maybe some text can be copied from the paper..
We copied the content in the paper into the introduction section of the documentation and it showed on the webpage.
- [x] Tutorials: They build on NSLS-II data standards and the bluesky framework, which might not be widely known/used outside of the Brookhaven National Laboratory. It also uses rapidz, which is a fork of streamz from the same group. The paper mentions streamz but rapidz is actually what's used throughout the package, except for one example file (average.py).
We changed the streamz to rapidz in the examples.
- [x] Paper: The paper looks good, I sent some nit-picks at JOSS paper: Minor grammar fixes xpdAcq/SHED#202
Grammar fixes have been merged.
- [x] References / State of the field: Maybe more references could be added to modern streaming dataflow system/models such as Timely dataflows which are popular recently? In addition, I think more details on what SHED adds compared to dask and streamz (that it builds upon) or ray might be interesting?
We added a reference to Naiad: A Timely Dataflow System.
What functionalities that SHED adds to the rapidz (forked from streamz) and dask are:
- Digest data stream in the event model schemas (https://nsls-ii.github.io/architecture-overview.html#event-based-architecture) and distribute the different data to corresponding consumer streams.
- Translate the data in a stream into the event model schemas and make it friendly to the database using the schemas.
- Serialize the data pipeline and rebuild the pipeline based on the serialized data.
- Replay the process that the data flow runs through the data pipeline after the data and the pipeline are loaded from the database._
Here are names of potential second reviewers:
- Dylan McReynolds (@dylanmcreynolds) at Advanced Light Source, Lawrence Berkeley National Laboratory (LBNL)
- Blaise Thompson (@untzag) University of Wisconson
- Hari Krishnan (@HarinarayanKrishnan) at LBNL
:wave: @dylanmcreynolds @untzag @HarinarayanKrishnan β you be willing to review this submission for JOSS? We carry out our checklist-driven reviews here in GitHub issues and follow these guidelines: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html
We already have one reviewer (@remram44) and are looking to find a second.
The package under review is SHED: Streaming Heterogeneous Event Data Tracking with Provenance, a Python tool for streaming data from event models, while tracking data provenance and workflow execution.
Randomly saw the call on Twitter. I'm not familiar with Bluesky but I'm working on a similar project myself so I'm happy to review this. My project is for internal use only, no intention to publish or sell so I don't think there's a conflict of interest. I will be very busy in the next two weeks but after that I can start.