digest icon indicating copy to clipboard operation
digest copied to clipboard

digest refactoring for performance

Open pearsonca opened this issue 1 year ago • 5 comments
trafficstars

While working #210, I noticed opportunities for opportunites for consolidating / refactoring code in digest.c.

In some sense, digest.c is providing a Facade around interacting with the various quirks of the imported hashing libraries. It does some setup / reinterpretation of the input from R, uses a series of cases to handle each of the different algorithms, and then returns a consistent output.

It seems over time those cases have been internally implemented in slightly different ways (and I imagine, the hasher interfaces have changed over time after initial implementation). There is now a fair bit of repetitious code and/or inconsistent naming/etc that could be consolidated. This would make the assorted "Consider X" issues smaller lifts to accomplish, present opportunities for a bit of unit testing on the C side, benchmarking, etc - all the normal benefits of DRY.

I propose the following refactoring objectives:

  • [x] refactor macros to functions (per discussion in #210)
  • [ ] eliminate the duplicate code for streaming vs not approaches (e.g. by figuring out how to share the setup / return code, while switching between stream vs blob inputs)
  • [ ] more clearly distinguish "jobs" of various functions (e.g. I imagine digest would be as a more of traffic manager, with the individual hasher interface logic being modularized out; then adding a new hasher means adding a new consistently-surfaced-function + another case to digest and presto)
  • [ ] additional linting (e.g. spaces around binary operators)

Thoughts? Additional objectives? I think some of these (and possibly additional targets added via discussion) are entangled, but there's some potential to isolate at least some of these as their own PRs - preferences one way or another?

pearsonca avatar Aug 21 '24 09:08 pearsonca

I am in favour of 1).

I am a little scared / see more downside than upside for 2) and 3) --- things work as they are, we manage to add new hasher so why rock the boat

I have so far refrained from larger-scale "style" only overhauls. I like when git blame refers to original contributions. [ I left your two-spaces indented lines alone too. ] I guess on the margin in favour but let's not get overboard.

Too little to keep you excited?

eddelbuettel avatar Aug 21 '24 20:08 eddelbuettel

Can do a PR for just converting macros => functions, sure.

Then maybe draft something for one of the others, and see how it looks w/o (initially) super polishing it? If it seems promising, great, proceed to polish; if not, can always toss it.

pearsonca avatar Aug 21 '24 20:08 pearsonca

Porting discussion from closed PRs #215 / #218 here:

  • any overhead reductions in digest need maintain code readability
  • better approach may be to introduce no-frills functionality

What could potentially work for that second item:

  • offering a direct (R) interface for each of the algorithms, with minimal parsing / error checking
  • refactor digest to do the error handling etc, then dispatch to the individual mode algorithm functions (this would invert the current relationship with e.g. sha1 R exports, so non-trivial work to do)
  • longer term that would suggest wanting to have more individualized wrappers in the C layer, but that wouldn't need to happen immediately
  • if pursing that longer term approach: also potentially makes more direct-to-C options available for dependent packages, if more done there to export the C layer

aside: the longer term approach would be advantageous for enabling people to implement their class-specific hashing extensions ala #23 and for potentially having a digestext or similar package, that offers custom class-specific extensions without bloating the core library.

pearsonca avatar Aug 28 '24 13:08 pearsonca

Draft approach for a refactor that would allow users to skip some of the overhead optionally: https://github.com/eddelbuettel/digest/pull/220 - would appreciate some feedback there about those outlines before pursuing further.

pearsonca avatar Sep 18 '24 18:09 pearsonca

In the broken up version of #220 - next step is #222. Pending feedback, this is essentially encapsulating all the algorithms. The next would be to register all of them.

pearsonca avatar Sep 19 '24 19:09 pearsonca