go-guerrilla icon indicating copy to clipboard operation
go-guerrilla copied to clipboard

Use non cryptographic hash for unique ID instead of MD5

Open tegk opened this issue 6 years ago • 10 comments

I would like to discuss to replace md5 with a non cryptographic hash function like xxhash or murmur3 as they are much faster.

For example we use md5 to generate an ID for the queued envelop.

func queuedID(clientID uint64) string {
	return fmt.Sprintf("%x", md5.Sum([]byte(string(time.Now().Unix())+string(clientID))))
}

func NewEnvelope(remoteAddr string, clientID uint64) *Envelope {
	return &Envelope{
		RemoteIP: remoteAddr,
		Values:   make(map[string]interface{}),
		QueuedId: queuedID(clientID),
	}
}

If we agree on the usage of xxhash I can send a pull request.

tegk avatar Aug 08 '19 15:08 tegk

Sounds good.

Could this be configurable and controlled using the config?

An idea would be:

in the mail package, we define something like this:

var QueuedIDGenerator QuidGenerator

type QuidGenerator func(clientID uint64) string

func defaultQueuedID(clientID uint64) string {
	return fmt.Sprintf("%x", md5.Sum([]byte(string(time.Now().Unix())+string(clientID))))
}

// (add whatever other generator function here, eg murmur3)

// change queuedID to delegate to our desired function
func queuedID(clientID uint64) string {
	return QueuedIDGenerator(clientID)
}

and to the init() at the top of envelope.go, we add QueuedIDGenerator = defaultQueuedID

Then we can also modify api.go to add a convenience func that allows us set to whatever func we pass (but don't let it change it once the server is running)

Then use the API call itself in server.go to set it to whatever value is in the config.

How does that sound?

flashmob avatar Aug 15 '19 09:08 flashmob

oh, on second look - the value from queuedID() gets ignored in the Redis, SQL and GuerrillaDBRedis example processors. Forgot about, they overwrite this value. So maybe the above suggested way is not needed. Perhaps we can just queuedID() to use whatever is the fatster golang hashing function and make a comment that this would most likely be set by the processor anyway.

Btw, would prefer to use a hasing function that doesn't introduce new external dependency. Which one to pick? https://golang.org/pkg/hash/#pkg-subdirectories

flashmob avatar Aug 15 '19 09:08 flashmob

we add QueuedIDGenerator = defaultQueuedID

I think, its wrong way. Need to hardcode faster hashing function and all. Nobody need to configure it, all members who use this package - need fast and stable smtp server.

lord-alfred avatar Aug 15 '19 09:08 lord-alfred

we add QueuedIDGenerator = defaultQueuedID

I think, its wrong way. Need to hardcode faster hashing function and all. Nobody need to configure it, all members who use this package - need fast and stable smtp server.

I agree, this not something anybody really wants to configure. We should just replace MD5 with xxHash and leave the current logic in place.

Not introducing an external dependency is a fair reason not to do it though as there is not a non crypto hash function in the standard lib.

tegk avatar Aug 15 '19 10:08 tegk

Btw, would prefer to use a hasing function that doesn't introduce new external dependency

And what is the problem of adding a new dependency? Anyway need to compile the executable file.

lord-alfred avatar Aug 15 '19 10:08 lord-alfred

@lord-alfred new dependencies can add technical debt. Also, since this package is made to be used as a library, it's better if the user of the library can decide on their own dependencies. (And maintain them) while keeping our dependencies as small as possible.

@tegk please ignore my first comment. It appears that the queuedID function is actually somewhat redundant. The hashing has been delegated to some the backends, and you can see that the SQL processor sets its own queuedID value. So let's keep this issue open to fix this inconsistency.

As for any performance issues, it would be great if we could start using pprof tool to generate some graphs/reports where we can identify any bottlenecks. Right now, indentifying without profiling is kind of like shooting in the dark.

flashmob avatar Aug 15 '19 14:08 flashmob

Here is a benchmark of hashing strings 3 to 254 characters long with md5 and xxhash. 3 is the min and 254 max length of an email address.

The change to xxHash would make sense in terms of latency (-84.8% less!) for sure.

goos: linux
goarch: amd64
pkg: md5vsxxhash
BenchmarkMD5minEmailAddress-12    	 5000000	       279 ns/op	     136 B/op	       4 allocs/op
BenchmarkXXminEmailAddress-12     	20000000	        66.6 ns/op	      32 B/op	       1 allocs/op
BenchmarkMD5maxEmailAddress-12    	 2000000	       665 ns/op	     384 B/op	       4 allocs/op
BenchmarkXXmaxEmailAddress-12     	20000000	       101 ns/op	      32 B/op	       1 allocs/op
PASS
ok  	md5vsxxhash	7.262s


package main

import (
	"crypto/md5"
	"io"
	"math/rand"
	"strconv"
	"testing"

	"github.com/cespare/xxhash"
)

var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

var minEmailAddress string = randSeq(3)
var maxEmailAddress string = randSeq(254)

func randSeq(n int) string {
	b := make([]rune, n)
	for i := range b {
		b[i] = letters[rand.Intn(len(letters))]
	}
	return string(b)
}

func genMD5(s string) string {
	h := md5.New()
	io.WriteString(h, s)
	return string(h.Sum(nil))
}

func genXXhash(s string) string {
	h := xxhash.Sum64String(s)
	return strconv.FormatUint(h, 10)
}

func BenchmarkMD5minEmailAddress(b *testing.B) {
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		genMD5(minEmailAddress)
	}
}

func BenchmarkXXminEmailAddress(b *testing.B) {
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		genXXhash(minEmailAddress)
	}
}

func BenchmarkMD5maxEmailAddress(b *testing.B) {
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		genMD5(maxEmailAddress)
	}
}

func BenchmarkXXmaxEmailAddress(b *testing.B) {
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		genXXhash(maxEmailAddress)
	}
}

The results also are confirmed when bench-marking the hash backed:

goos: linux
goarch: amd64
pkg: md5vsxxhash
BenchmarkMD5hash-12    	    2000	    800720 ns/op	  304740 B/op	    3009 allocs/op
BenchmarkXXhash-12     	   10000	    150928 ns/op	  256604 B/op	    1006 allocs/op
PASS
ok  	md5vsxxhash	3.214s
package main

import (
	"crypto/md5"
	"fmt"
	"io"
	"math/rand"
	"strconv"
	"strings"
	"testing"
	"time"

	"github.com/cespare/xxhash"
)

var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")

var maxEmailAddress string = randSeq(254)
var maxSubject string = randSeq(78)

func randSeq(n int) string {
	b := make([]rune, n)
	for i := range b {
		b[i] = letters[rand.Intn(len(letters))]
	}
	return string(b)
}

func genMD5(s string) string {
	h := md5.New()
	io.WriteString(h, s)
	return string(h.Sum(nil))
}

func genXXhash(s string) string {
	h := xxhash.Sum64String(s)
	return strconv.FormatUint(h, 10)
}

func genMD5forEmail() {
	h := md5.New()
	ts := fmt.Sprintf("%d", time.Now().UnixNano())
	_, _ = io.Copy(h, strings.NewReader(maxEmailAddress))
	_, _ = io.Copy(h, strings.NewReader(maxSubject))
	_, _ = io.Copy(h, strings.NewReader(ts))
	// using the base hash, calculate a unique hash for each recipient
	for i := 0; i < 1000; i++ {
		h2 := h
		_, _ = io.Copy(h2, strings.NewReader(maxEmailAddress))
		h2.Sum([]byte{})
	}
}

func genXXforEmail() {
	h := xxhash.New()
	ts := fmt.Sprintf("%d", time.Now().UnixNano())
	h.Write([]byte(maxEmailAddress))
	h.Write([]byte(maxSubject))
	h.Write([]byte(ts))
	// using the base hash, calculate a unique hash for each recipient
	for i := 0; i < 1000; i++ {
		h2 := h
		h2.Write([]byte(maxEmailAddress))
		h2.Sum64()
	}
}

func BenchmarkMD5hash(b *testing.B) {
	for n := 0; n < b.N; n++ {
		genMD5forEmail()
	}
}

func BenchmarkXXhash(b *testing.B) {
	for n := 0; n < b.N; n++ {
		genXXforEmail()
	}
}


tegk avatar Aug 15 '19 15:08 tegk

Not disagreeing with you there. Definitelly md5 is much slower.

Although, please note that in practice, it's only performed, in the extreme case, say 1000 times a second. And that's only on a reasonably small string. So it's probably not that much worry about.

On Fri., 16 Aug. 2019, 01:32 Till Knuesting, [email protected] wrote:

``

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/flashmob/go-guerrilla/issues/169?email_source=notifications&email_token=AAE6MP4AGKJHCDZSB5CPUS3QEWAI5A5CNFSM4IKMCWGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MJ5WA#issuecomment-521707224, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE6MP5ZGEZ5NR2V2GBNCG3QEWAI5ANCNFSM4IKMCWGA .

flashmob avatar Aug 15 '19 17:08 flashmob

You can use xor too and go REALLY fast.. doesn't mean you should though. At least maybe warn users?

h1z1 avatar Feb 20 '20 16:02 h1z1

This value is actually just a filler because ideally a Message ID should be created by the backed.. Indeed, it could be changed to something simple and use things like the system date, client id and so on, no need for md5 there.

As for performance wise, It would be really good to use pprof to analyze this (and other parts) in the future.

On Fri, 21 Feb 2020, 01:01 h1z1, [email protected] wrote:

You can use xor too and go REALLY fast.. doesn't mean you should though. At least maybe warn users?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/flashmob/go-guerrilla/issues/169?email_source=notifications&email_token=AAE6MP3QBXLTEH2IF2CBXP3RD2SL5A5CNFSM4IKMCWGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMO4OTY#issuecomment-589154127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE6MP2M3FXKKPJGSZNSHTDRD2SL5ANCNFSM4IKMCWGA .

flashmob avatar Feb 20 '20 22:02 flashmob