komihash icon indicating copy to clipboard operation
komihash copied to clipboard

Feature request: streaming API

Open data-man opened this issue 3 years ago • 3 comments
trafficstars

Thank you for the project!

Would be nice to have something like this (like xxhash):

komi_state_t* komi_createState(void);
komi_errorcode komi_freeState(komi_state_t* statePtr);
komi_errorcode komi_reset(komi_state_t* statePtr, uint64_t seed);
komi_errorcode komi_update(komi_state_t* statePtr, const void* input, size_t length);

data-man avatar Oct 06 '22 03:10 data-man

Hello! What are the use-cases for the streamed version? At this point I'm not very keen on producing a streamed version. I'm also thinking that concatenation of strings in a temporary buffer and then hashing the buffer may actually be not that much slower, because streamed implementation may need to accumulate a part of the input anyway (same memcpy). The input data stitching logic may consume a lot of cycles if "update" is called several times over a series of small inputs.

avaneev avatar Oct 06 '22 04:10 avaneev

What are the use-cases for the streamed version?

Read data from a file chunk by chunk and calculate a hash of every chunk.

data-man avatar Oct 06 '22 04:10 data-man

Okay, I'll think about a better way to implement that. At the moment you may take a look at PRVHASH64S: https://github.com/avaneev/prvhash It offers about 8.4 GB/s streamed hashing throughput and can produce a hash value of any required size. Streamed komihash may be twice faster, but considering storage system throughput it's not as important.

avaneev avatar Oct 06 '22 05:10 avaneev

Hi! I've updated project's page with info on "Sequential-Incremental Hashing". It can be also used to hash files, but requires some given read block size (using different read block sizes will produce different hash values).

avaneev avatar Nov 02 '22 06:11 avaneev

Hello!

I've updated project's page with info on "Sequential-Incremental Hashing".

I think, it's incorrect:

#include "komihash.h"
#include <stdio.h>

int main()
{
    const char* str1 = "123456";
    const char* str2 = "123";
    const char* str3 = "456";

    uint64_t h1 = komihash(str1, sizeof(str1), 0 );
    uint64_t h2 = komihash(str2, sizeof(str2), 0 );
    uint64_t h3 = komihash(str3, sizeof(str3), h2 );


    printf("h1: %x\n", h1);
    printf("h2: %x\n", h2);
    printf("h3: %x\n", h3);
}

h1: c84c16d8 h2: fbe4c41c h3: 71481aea

data-man avatar Nov 21 '22 22:11 data-man

What do you mean "incorrect"?

By the way, you should write this way:

    uint64_t h1 = komihash(str1, sizeof(str1), 0 );
    uint64_t h2 = komihash(str2, sizeof(str2), h1 );
    uint64_t h3 = komihash(str3, sizeof(str3), h2 );

You've missed passing h1 to str2. Then hashes are 64-bit and you are printing with %x. You should use %llx.

avaneev avatar Nov 22 '22 02:11 avaneev

I meant h3 should be equal to h1.

data-man avatar Nov 22 '22 02:11 data-man

You've misunderstood the concept. It's not a streamed hashing. Sequential hashing assumes length of each item is also a part of the message.

avaneev avatar Nov 22 '22 03:11 avaneev

It's not a streamed hashing.

So I wonder why you closed this issue.

data-man avatar Nov 22 '22 03:11 data-man

You can hash all your files using blocks of e.g. 1024 bytes (except the last block). It will be the same as streamed hashing for this given block size. Buffered streamed hashing is needed for "standardized" hashes. Here with komihash I do not see such need. Streamed hashing is also slower.

Also, from database point of view, when several independent values are concatenated, this incremental approach is preferable to streamed hashing since each item's length is encoded implicitly.

avaneev avatar Nov 22 '22 03:11 avaneev

I've implemented the streamed hashing after all, please check it out. It turned out to be a bit faster than the base komihash() function.

avaneev avatar Dec 06 '22 17:12 avaneev

Awesome, thank you!

data-man avatar Dec 06 '22 18:12 data-man