digest icon indicating copy to clipboard operation
digest copied to clipboard

encapsulate and expose core hashing algorithms

Open pearsonca opened this issue 1 year ago • 23 comments
trafficstars

This draft shows the notional extraction of the core hashing capability in prep for exposing the algorithms directly.

Whatever the precise implementation approach, to me it seems inevitable that the right approach is to pop each algorithm out of digest, and turn digest into a traffic manager (and in the next layer of this notional series of PRs: provide a direct approach to linking with the algorithms more directly).

This implements that pop-out via a macro to regularize the interface. In essense, that macro is forming a base class, with each macro invocation on the collected bits for each algorithm effectively declaring the subclasses. We could approach this style of solution via some of the C approaches to introducing a class system, but I think the macro approach is the cleanest. If we absolutely wanted a non-preprocessing version, then switching to C++ seems like the preferred approach.

If we're roughly comfortable with the macro approach, the remaining steps for this PR:

  • [ ] write core_XYZ for all of the algorithms
  • [ ] substitute those into digest, prune any vestigal code
  • [ ] write changelog

Optionally: introduce a stream_ALGO complementary to core_ALGO (and maybe rename core_ALGO to one_ALGO?). This would further consolidate digest / DRY out algorithm implementation

pearsonca avatar Sep 19 '24 19:09 pearsonca

Sound of a broken record: I am still not a fan of macros.

How about if you pick one algo you really like and want to use from source level, and implement it as a one off. We make that work. We see what time gain it gives us.

And then we stare long and hard at the code and see how we can with proper C functions, maybe inling and whatnot, but not macros. And if we really think we cannot do it with macros, and really think using them is better than a little code replication then we consider them?

Could you live with that approach?

eddelbuettel avatar Sep 19 '24 19:09 eddelbuettel

Sound of a broken record: I am still not a fan of macros.

How about if you pick one algo you really like and want to use from source level, and implement it as a one off. We make that work. We see what time gain it gives us.

And then we stare long and hard at the code and see how we can with proper C functions, maybe inling and whatnot, but not macros. And if we really think we cannot do it with macros, and really think using them is better than a little code replication then we consider them?

Could you live with that approach?

I think that could work as an approach.

But: floating another idea that doesn't use macros, while still a generic approach - thoughts on what's in here now? To play my own broken record: I'm very against writing code more than once when the same action is happening.

pearsonca avatar Sep 19 '24 21:09 pearsonca

I like the last commit a lot better.

BTW, and I just want to float this here for discussion, I have of course a preference for C++ over plain old C. I once floated a few macros from R in a package (with examples in the vignette. I like that digest is zero depends, but we could just copy a few of the declarations over if it helps, and we could rethink some mechanics as simple C++ tricks as we can mix in C++ into the package. New file src/direct_digest.cpp say. Interface functions exported by the R mechanism have of course still be C only using SEXPs etc but we can maybe use some helpers. Not urgent but in case you like the idea and want to ponder it.

Long story short: I like simple and focussed extra accessors to the available digest algos. We're not forced to replicate all the functionality of the digest::digest() R function. In fact, we can pick and choose, and md5 and xxH32 are a fine start. (Edit: added missing "all the")

eddelbuettel avatar Sep 19 '24 21:09 eddelbuettel

I like the last commit a lot better.

BTW, and I just want to float this here for discussion, I have of course a preference for C++ over plain old C. I once floated a few macros from R in a package (with examples in the vignette. I like that digest is zero depends, but we could just copy a few of the declarations over if it helps, and we could rethink some mechanics as simple C++ tricks as we can mix in C++ into the package. New file src/direct_digest.cpp say. Interface functions exported by the R mechanism have of course still be C only using SEXPs etc but we can maybe use some helpers. Not urgent but in case you like the idea and want to ponder it.

Long story short: I like simple and focussed extra accessors to the available digest algos. We're not forced to replicate functionality of the digest::digest() R function. In fact, we can pick and choose, and md5 and xxH32 are a fine start.

I'm going to do a bit of benchmarking in here to understand if this is better. I'm also going to try to put together some test - if you got a minute to think on what those should be, would appreciate any ideas, however rough. I'm thinking roughly that there should be some cases where digest == direct approaches, e.g. raw inputs. Seems like there should be some other items that are equivalent - e.g. a character vector vs the same raw bits. Any others?

I'm definitely open to mixed C and C++, will have a separately longer think about that.

pearsonca avatar Sep 19 '24 21:09 pearsonca

Screenshot from 2024-09-20 13-46-08

quick look w/ microbenchmark - the direct version for equivalent asks is substantially faster.

I'm not exactly sure how to capture these methods in testing, so have roughed in some ideas.

pearsonca avatar Sep 20 '24 17:09 pearsonca

(Maybe better than a screenshot is a simple autoplot(res) (or whatever the handy microbenchmark result plotter is called).)

eddelbuettel avatar Sep 20 '24 17:09 eddelbuettel

(Maybe better than a screenshot is a simple autoplot(res) (or whatever the handy microbenchmark result plotter is called).)

agree - was lazily trying to just leverage how i'm trying to use it in tinytest, and it was pretty striking obvious from the numbers in this case.

pearsonca avatar Sep 20 '24 18:09 pearsonca

Alright, seems like we agree this is basically in the right direction.

What needs doing on this PR?

I'd like to add tests confirming proper handling of the types that we currently think are being managed.

Do we want to deal with C-side serialization now or next? (edit: and the planned resolution here is copy over RApiSerialize? or introduce dependency?)

I think expose-all-the-algorithms can be after this (I think it definitely comes after serialization).

pearsonca avatar Sep 20 '24 18:09 pearsonca

I had a quick look at what I do in RcppRedis, and it is basically just one header add to its src/Redis.cpp (and LinkingTo in DESCRIPTION) and then basically whevever it is needed

    // redis set -- serializes to R internal format
    std::string set(std::string key, SEXP s) {

        // if raw, use as is else serialize to raw
        Rcpp::RawVector x = (TYPEOF(s) == RAWSXP) ? s : serializeToRaw(s);

and that should be it. It's a copy of R's own serialization (which R Core decides not to expose via public API, sadly). I think I also play tricks for cuteness and make the compilation conditional (or maybe that was in another package?) to keep digest at zero hard dependencies. We could also avoid the issue if we called serialize() from the R code but you likely want to save those function calls too.

I can look into adding this to your branch if you want me to.

eddelbuettel avatar Sep 20 '24 19:09 eddelbuettel

I had a quick look at what I do in RcppRedis, and it is basically just one header add to its src/Redis.cpp (and LinkingTo in DESCRIPTION) and then basically whevever it is needed

    // redis set -- serializes to R internal format
    std::string set(std::string key, SEXP s) {

        // if raw, use as is else serialize to raw
        Rcpp::RawVector x = (TYPEOF(s) == RAWSXP) ? s : serializeToRaw(s);

and that should be it. It's a copy of R's own serialization (which R Core decides not to expose via public API, sadly). I think I also play tricks for cuteness and make the compilation conditional (or maybe that was in another package?) to keep digest at zero hard dependencies. We could also avoid the issue if we called serialize() from the R code but you likely want to save those function calls too.

I can look into adding this to your branch if you want me to.

Should just work for you do it.

One other item: what's the standard for documenting out registered C methods? (related: is there some autodocumentation / NAMESPACE management I'm missing here, or is it manual?)

pearsonca avatar Sep 20 '24 19:09 pearsonca

or is it manual?

Might be in this package as I am conservative esp in old and working packages. In some other newer packages I do rely on roxygen2 and its @export tag.

eddelbuettel avatar Sep 20 '24 19:09 eddelbuettel

Turns out I may have been sloppy and or lazy in RApiSerialize and it can only be included from C++ code :grinning: so I added src/serializing.cpp and we now have another C++ file. It is tiny.



#include <RApiSerializeAPI.h>   	// provides C API with serialization for R

extern "C" SEXP serialize_to_raw(SEXP inp) {
    return serializeToRaw(inp, Rf_ScalarInteger(2), Rf_ScalarLogical(1));
}

extern "C" SEXP unserialize_from_raw(SEXP inp) {
    return unserializeFromRaw(inp);
}

It needs

modified   DESCRIPTION
@@ -54,7 +54,8 @@ Description: Implementation of a function 'digest()' for the creation of hash
 URL: https://github.com/eddelbuettel/digest, https://dirk.eddelbuettel.com/code/digest.html
 BugReports: https://github.com/eddelbuettel/digest/issues
 Depends: R (>= 3.3.0)
-Imports: utils
+LinkingTo: RApiSerialize
+Imports: utils, RApiSerialize
 License: GPL (>= 2)
 Suggests: tinytest, simplermarkdown, microbenchmark
 VignetteBuilder: simplermarkdown

~but it also needs library(RApiSerialize). I think that can be circumvented by triggering a symbol import in our init.c.~ It also needs import(RApiSerialize) in NAMESPACE

Then we can do

edd@rob:~/git/digest-personca(core_macro)$ Rscript -e 'library(digest); jenny <- as.integer(c(8,6,7,5,3,0,9)); .Call("serialize_to_raw", jenny, package="digest")'
 [1] 58 0a 00 00 00 02 00 04 04 01 00 02 03 00 00 00 00 0d 00 00 00 07 00 00 00 08 00 00 00 06 00 00 00 07 00 00 00 05 00 00 00 03 00 00 00 00 00 00 00 09
edd@rob:~/git/digest-personca(core_macro)$ Rscript -e 'library(digest); jenny <- as.integer(c(8,6,7,5,3,0,9)); .Call("unserialize_from_raw", .Call("serialize_to_raw", jenny, package="digest"), package="digest")'
[1] 8 6 7 5 3 0 9
edd@rob:~/git/digest-personca(core_macro)$ 

As an aside, you still seem to have one signedness issue in your branch:

digest.c: In function ‘_core_XXH32’:
digest.c:227:16: warning: pointer targets in passing argument 1 of ‘rev_memcpy’ differ in signedness [-Wpointer-sign]
  227 |     rev_memcpy(output, &val, sizeof(XXH32_hash_t));
      |                ^~~~~~
      |                |
      |                unsigned char *
digest.c:139:23: note: expected ‘char *’ but argument is of type ‘unsigned char *’
  139 | void rev_memcpy(char *dst, const void *src, int len) {
      |                 ~~~~~~^~~

All this reminds me how unpleasant it is to do work in C. Maybe we should think about this differently and

  • just export the C functions from digest
  • set up a new tiny package that takes them
  • and quickly digest at native level

And we likely would want to do that in C++.

(I can send a PR to your branch if you want.)

eddelbuettel avatar Sep 21 '24 15:09 eddelbuettel

Made a minor cleanup. Now

$ Rscript -e 'library(digest); jenny <- as.integer(c(8,6,7,5,3,0,9)); .Call("serialize_to_raw", jenny)'
 [1] 58 0a 00 00 00 02 00 04 04 01 00 02 03 00 00 00 00 0d 00 00 00 07 00 00 00 08 00 00 00 06 00 00 00 07 00 00 00 05 00 00 00 03 00 00 00 00 00 00 00 09
$ Rscript -e 'library(digest); jenny <- as.integer(c(8,6,7,5,3,0,9)); .Call("unserialize_from_raw", .Call("serialize_to_raw", jenny))'
[1] 8 6 7 5 3 0 9
$ 

eddelbuettel avatar Sep 23 '24 16:09 eddelbuettel

Made a minor cleanup. Now

$ Rscript -e 'library(digest); jenny <- as.integer(c(8,6,7,5,3,0,9)); .Call("serialize_to_raw", jenny)'
 [1] 58 0a 00 00 00 02 00 04 04 01 00 02 03 00 00 00 00 0d 00 00 00 07 00 00 00 08 00 00 00 06 00 00 00 07 00 00 00 05 00 00 00 03 00 00 00 00 00 00 00 09
$ Rscript -e 'library(digest); jenny <- as.integer(c(8,6,7,5,3,0,9)); .Call("unserialize_from_raw", .Call("serialize_to_raw", jenny))'
[1] 8 6 7 5 3 0 9
$ 

looks good; I think the settings are such that you should just be able to commit to this branch - please try that?

pearsonca avatar Sep 23 '24 16:09 pearsonca

I think so too as it is a branch with a PR into my repo so AFAIUI github lets me. [ Commits, pushes. ] And it does.

I am starting to think that maybe a cleaner, lighter, simpler, ... approach would be to

  • export the hashers we want from digest
  • setup a small and simple 'direct_digest' package up that pulls them, and RApiSerialize, in

eddelbuettel avatar Sep 23 '24 17:09 eddelbuettel

I think so too as it is a branch with a PR into my repo so AFAIUI github lets me. [ Commits, pushes. ] And it does.

I am starting to think that maybe a cleaner, lighter, simpler, ... approach would be to

  • export the hashers we want from digest
  • setup a small and simple 'direct_digest' package up that pulls them, and RApiSerialize, in

Along those lines, I was thinking something like digest_core (no or minimal R interface, reused core bits) => digest_XYZ (particular hashers, can potentially be independently implemented / maintained) => digest (unified interface matching historical standard).

That's an overhaul that provides flexibility for the future, while maintiaining backwards compatibility.

pearsonca avatar Sep 23 '24 17:09 pearsonca

I think I would prefer to leave the digest R and C functions in the digest R package alone. They. Work.

But we can and should re-use the infrastructure they already provide, and that is widely used. It would take very little to export, say, your _core_XXH32() and _core_MD5(). I would simply stick a SEXP in as first argument, otherwise forgo the unified interface (no seed used in md5, I think).

By exporting these, and using RApiSerialize, we have what started this: the basis to deliver trusted hashing a little faster and we 'remain below the surface of the water' and talk C(++) to C(++) directly reusing the object code from the digest package.

eddelbuettel avatar Sep 23 '24 18:09 eddelbuettel

We are having side-effects. I filed a (in hindsight: non-) bug report as digest::digest misbehaved: https://github.com/HenrikBengtsson/future/issues/740

Call that another vote for leaving digest alone and trying the 'direct' mods in a standalone package.

eddelbuettel avatar Sep 24 '24 14:09 eddelbuettel

We are having side-effects. I filed a (in hindsight: non-) bug report as digest::digest misbehaved: HenrikBengtsson/future#740

Call that another vote for leaving digest alone and trying the 'direct' mods in a standalone package.

Heard - I'll add this to the TODO list.

I think that makes this PR a no-go - anything in here we do want to preserve?

pearsonca avatar Sep 27 '24 13:09 pearsonca

I think that makes this PR a no-go

I was a bit surprised to see breakage. Which went away in the main branch. I haven't time or energy to chase where it came from though :disappointed:

In other news I did make a clean-up in RApiSerialize, realizing that what mean to be a C interface is only a C++ interface (optional args issue). So I made a more official (optional) C++ interface, this should be of no effect to the two client packages. And I liked that experience. See the change file and a bit of discussion in an issue.

eddelbuettel avatar Sep 27 '24 13:09 eddelbuettel

I think that makes this PR a no-go

I was a bit surprised to see breakage. Which went away in the main branch. I haven't time or energy to chase where it came from though 😞

In other news I did make a clean-up in RApiSerialize, realizing that what mean to be a C interface is only a C++ interface (optional args issue). So I made a more official (optional) C++ interface, this should be of no effect to the two client packages. And I liked that experience. See the change file and a bit of discussion in an issue.

Interesting, but a bit confused: I didn't immediately see anything in serialize.cpp that was actual cpp? If there isn't any, seems like that could just be c, and then work fine with either c or cpp?

Also, this block seems to be missing a branch for no long vector support?

    #ifdef LONG_VECTOR_SUPPORT
        if(needed < 10000000) /* ca 10MB */
            needed = (1+2*needed/INCR) * INCR;
        else 
            needed = (R_xlen_t)((1+1.2*(double)needed/INCR) * INCR);
    #else
        if(needed < 10000000) /* ca 10MB */
            needed = (1+2*needed/INCR) * INCR;
        else if(needed < 1700000000) /* close to 2GB/1.2 */
            needed = (R_xlen_t)((1+1.2*(double)needed/INCR) * INCR);
        else if(needed < INT_MAX - INCR)
            needed = (1+needed/INCR) * INCR;
    #endif

pearsonca avatar Sep 27 '24 14:09 pearsonca

That is a standard usage pattern of C++ code calling into C code. Given that R itself is only C (and Fortran) it is what we do all the time. RApiSerialize does not claim it did C++ internally, I just made it a little easier / cleaner to call it from C++. And in fact RcppRedis (and qs by Travers) had called from C++ all those years. Into the C language FFI.

And yep, once you drill down, it is in fact possible to find bad pothole. RApiSerialize is a static copy of code from base R (as they do not want to expose the interface) so I may have taken it at a time when long vectors were/are limited. There is alway scope for an ALTREP extension but that is then an issue in package RApiSerialize. I am also overdue a check against what R itself currently has. It has not been an issue in qs or RcppRedis as best as I know. And last we bantered here, you said your use case was many small vectors ... That said, extending to long vectors is of course nice too.

eddelbuettel avatar Sep 27 '24 14:09 eddelbuettel

PS Looks like it is still pretty much the same code in R itself - lines 2811 to 2835 here (and GH is wonky and won't let me create a link now): https://github.com/r-devel/r-svn/blob/main/src/main/serialize.c

eddelbuettel avatar Sep 27 '24 14:09 eddelbuettel

Similar to #220 I am going to close this. Happy to revisit as stated there.

eddelbuettel avatar Oct 04 '25 01:10 eddelbuettel