digest
digest copied to clipboard
digest refactoring for performance
While working #210, I noticed opportunities for opportunites for consolidating / refactoring code in digest.c.
In some sense, digest.c is providing a Facade around interacting with the various quirks of the imported hashing libraries. It does some setup / reinterpretation of the input from R, uses a series of cases to handle each of the different algorithms, and then returns a consistent output.
It seems over time those cases have been internally implemented in slightly different ways (and I imagine, the hasher interfaces have changed over time after initial implementation). There is now a fair bit of repetitious code and/or inconsistent naming/etc that could be consolidated. This would make the assorted "Consider X" issues smaller lifts to accomplish, present opportunities for a bit of unit testing on the C side, benchmarking, etc - all the normal benefits of DRY.
I propose the following refactoring objectives:
- [x] refactor macros to functions (per discussion in #210)
- [ ] eliminate the duplicate code for streaming vs not approaches (e.g. by figuring out how to share the setup / return code, while switching between stream vs blob inputs)
- [ ] more clearly distinguish "jobs" of various functions (e.g. I imagine
digestwould be as a more of traffic manager, with the individual hasher interface logic being modularized out; then adding a new hasher means adding a new consistently-surfaced-function + another case to digest and presto) - [ ] additional linting (e.g. spaces around binary operators)
Thoughts? Additional objectives? I think some of these (and possibly additional targets added via discussion) are entangled, but there's some potential to isolate at least some of these as their own PRs - preferences one way or another?
I am in favour of 1).
I am a little scared / see more downside than upside for 2) and 3) --- things work as they are, we manage to add new hasher so why rock the boat
I have so far refrained from larger-scale "style" only overhauls. I like when git blame refers to original contributions. [ I left your two-spaces indented lines alone too. ] I guess on the margin in favour but let's not get overboard.
Too little to keep you excited?
Can do a PR for just converting macros => functions, sure.
Then maybe draft something for one of the others, and see how it looks w/o (initially) super polishing it? If it seems promising, great, proceed to polish; if not, can always toss it.
Porting discussion from closed PRs #215 / #218 here:
- any overhead reductions in digest need maintain code readability
- better approach may be to introduce no-frills functionality
What could potentially work for that second item:
- offering a direct (R) interface for each of the algorithms, with minimal parsing / error checking
- refactor digest to do the error handling etc, then dispatch to the individual mode algorithm functions (this would invert the current relationship with e.g.
sha1R exports, so non-trivial work to do) - longer term that would suggest wanting to have more individualized wrappers in the C layer, but that wouldn't need to happen immediately
- if pursing that longer term approach: also potentially makes more direct-to-C options available for dependent packages, if more done there to export the C layer
aside: the longer term approach would be advantageous for enabling people to implement their class-specific hashing extensions ala #23 and for potentially having a digestext or similar package, that offers custom class-specific extensions without bloating the core library.
Draft approach for a refactor that would allow users to skip some of the overhead optionally: https://github.com/eddelbuettel/digest/pull/220 - would appreciate some feedback there about those outlines before pursuing further.
In the broken up version of #220 - next step is #222. Pending feedback, this is essentially encapsulating all the algorithms. The next would be to register all of them.