Use non cryptographic hash for unique ID instead of MD5
I would like to discuss to replace md5 with a non cryptographic hash function like xxhash or murmur3 as they are much faster.
For example we use md5 to generate an ID for the queued envelop.
func queuedID(clientID uint64) string {
return fmt.Sprintf("%x", md5.Sum([]byte(string(time.Now().Unix())+string(clientID))))
}
func NewEnvelope(remoteAddr string, clientID uint64) *Envelope {
return &Envelope{
RemoteIP: remoteAddr,
Values: make(map[string]interface{}),
QueuedId: queuedID(clientID),
}
}
If we agree on the usage of xxhash I can send a pull request.
Sounds good.
Could this be configurable and controlled using the config?
An idea would be:
in the mail package, we define something like this:
var QueuedIDGenerator QuidGenerator
type QuidGenerator func(clientID uint64) string
func defaultQueuedID(clientID uint64) string {
return fmt.Sprintf("%x", md5.Sum([]byte(string(time.Now().Unix())+string(clientID))))
}
// (add whatever other generator function here, eg murmur3)
// change queuedID to delegate to our desired function
func queuedID(clientID uint64) string {
return QueuedIDGenerator(clientID)
}
and to the init() at the top of envelope.go, we add QueuedIDGenerator = defaultQueuedID
Then we can also modify api.go to add a convenience func that allows us set to whatever func we pass (but don't let it change it once the server is running)
Then use the API call itself in server.go to set it to whatever value is in the config.
How does that sound?
oh, on second look - the value from queuedID() gets ignored in the Redis, SQL and GuerrillaDBRedis example processors. Forgot about, they overwrite this value. So maybe the above suggested way is not needed. Perhaps we can just queuedID() to use whatever is the fatster golang hashing function
and make a comment that this would most likely be set by the processor anyway.
Btw, would prefer to use a hasing function that doesn't introduce new external dependency. Which one to pick? https://golang.org/pkg/hash/#pkg-subdirectories
we add QueuedIDGenerator = defaultQueuedID
I think, its wrong way. Need to hardcode faster hashing function and all. Nobody need to configure it, all members who use this package - need fast and stable smtp server.
we add QueuedIDGenerator = defaultQueuedID
I think, its wrong way. Need to hardcode faster hashing function and all. Nobody need to configure it, all members who use this package - need fast and stable smtp server.
I agree, this not something anybody really wants to configure. We should just replace MD5 with xxHash and leave the current logic in place.
Not introducing an external dependency is a fair reason not to do it though as there is not a non crypto hash function in the standard lib.
Btw, would prefer to use a hasing function that doesn't introduce new external dependency
And what is the problem of adding a new dependency? Anyway need to compile the executable file.
@lord-alfred new dependencies can add technical debt. Also, since this package is made to be used as a library, it's better if the user of the library can decide on their own dependencies. (And maintain them) while keeping our dependencies as small as possible.
@tegk please ignore my first comment. It appears that the queuedID function is actually somewhat redundant. The hashing has been delegated to some the backends, and you can see that the SQL processor sets its own queuedID value. So let's keep this issue open to fix this inconsistency.
As for any performance issues, it would be great if we could start using pprof tool to generate some graphs/reports where we can identify any bottlenecks. Right now, indentifying without profiling is kind of like shooting in the dark.
Here is a benchmark of hashing strings 3 to 254 characters long with md5 and xxhash. 3 is the min and 254 max length of an email address.
The change to xxHash would make sense in terms of latency (-84.8% less!) for sure.
goos: linux
goarch: amd64
pkg: md5vsxxhash
BenchmarkMD5minEmailAddress-12 5000000 279 ns/op 136 B/op 4 allocs/op
BenchmarkXXminEmailAddress-12 20000000 66.6 ns/op 32 B/op 1 allocs/op
BenchmarkMD5maxEmailAddress-12 2000000 665 ns/op 384 B/op 4 allocs/op
BenchmarkXXmaxEmailAddress-12 20000000 101 ns/op 32 B/op 1 allocs/op
PASS
ok md5vsxxhash 7.262s
package main
import (
"crypto/md5"
"io"
"math/rand"
"strconv"
"testing"
"github.com/cespare/xxhash"
)
var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
var minEmailAddress string = randSeq(3)
var maxEmailAddress string = randSeq(254)
func randSeq(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return string(b)
}
func genMD5(s string) string {
h := md5.New()
io.WriteString(h, s)
return string(h.Sum(nil))
}
func genXXhash(s string) string {
h := xxhash.Sum64String(s)
return strconv.FormatUint(h, 10)
}
func BenchmarkMD5minEmailAddress(b *testing.B) {
b.ResetTimer()
for n := 0; n < b.N; n++ {
genMD5(minEmailAddress)
}
}
func BenchmarkXXminEmailAddress(b *testing.B) {
b.ResetTimer()
for n := 0; n < b.N; n++ {
genXXhash(minEmailAddress)
}
}
func BenchmarkMD5maxEmailAddress(b *testing.B) {
b.ResetTimer()
for n := 0; n < b.N; n++ {
genMD5(maxEmailAddress)
}
}
func BenchmarkXXmaxEmailAddress(b *testing.B) {
b.ResetTimer()
for n := 0; n < b.N; n++ {
genXXhash(maxEmailAddress)
}
}
The results also are confirmed when bench-marking the hash backed:
goos: linux
goarch: amd64
pkg: md5vsxxhash
BenchmarkMD5hash-12 2000 800720 ns/op 304740 B/op 3009 allocs/op
BenchmarkXXhash-12 10000 150928 ns/op 256604 B/op 1006 allocs/op
PASS
ok md5vsxxhash 3.214s
package main
import (
"crypto/md5"
"fmt"
"io"
"math/rand"
"strconv"
"strings"
"testing"
"time"
"github.com/cespare/xxhash"
)
var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
var maxEmailAddress string = randSeq(254)
var maxSubject string = randSeq(78)
func randSeq(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return string(b)
}
func genMD5(s string) string {
h := md5.New()
io.WriteString(h, s)
return string(h.Sum(nil))
}
func genXXhash(s string) string {
h := xxhash.Sum64String(s)
return strconv.FormatUint(h, 10)
}
func genMD5forEmail() {
h := md5.New()
ts := fmt.Sprintf("%d", time.Now().UnixNano())
_, _ = io.Copy(h, strings.NewReader(maxEmailAddress))
_, _ = io.Copy(h, strings.NewReader(maxSubject))
_, _ = io.Copy(h, strings.NewReader(ts))
// using the base hash, calculate a unique hash for each recipient
for i := 0; i < 1000; i++ {
h2 := h
_, _ = io.Copy(h2, strings.NewReader(maxEmailAddress))
h2.Sum([]byte{})
}
}
func genXXforEmail() {
h := xxhash.New()
ts := fmt.Sprintf("%d", time.Now().UnixNano())
h.Write([]byte(maxEmailAddress))
h.Write([]byte(maxSubject))
h.Write([]byte(ts))
// using the base hash, calculate a unique hash for each recipient
for i := 0; i < 1000; i++ {
h2 := h
h2.Write([]byte(maxEmailAddress))
h2.Sum64()
}
}
func BenchmarkMD5hash(b *testing.B) {
for n := 0; n < b.N; n++ {
genMD5forEmail()
}
}
func BenchmarkXXhash(b *testing.B) {
for n := 0; n < b.N; n++ {
genXXforEmail()
}
}
Not disagreeing with you there. Definitelly md5 is much slower.
Although, please note that in practice, it's only performed, in the extreme case, say 1000 times a second. And that's only on a reasonably small string. So it's probably not that much worry about.
On Fri., 16 Aug. 2019, 01:32 Till Knuesting, [email protected] wrote:
``
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/flashmob/go-guerrilla/issues/169?email_source=notifications&email_token=AAE6MP4AGKJHCDZSB5CPUS3QEWAI5A5CNFSM4IKMCWGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MJ5WA#issuecomment-521707224, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE6MP5ZGEZ5NR2V2GBNCG3QEWAI5ANCNFSM4IKMCWGA .
You can use xor too and go REALLY fast.. doesn't mean you should though. At least maybe warn users?
This value is actually just a filler because ideally a Message ID should be created by the backed.. Indeed, it could be changed to something simple and use things like the system date, client id and so on, no need for md5 there.
As for performance wise, It would be really good to use pprof to analyze this (and other parts) in the future.
On Fri, 21 Feb 2020, 01:01 h1z1, [email protected] wrote:
You can use xor too and go REALLY fast.. doesn't mean you should though. At least maybe warn users?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/flashmob/go-guerrilla/issues/169?email_source=notifications&email_token=AAE6MP3QBXLTEH2IF2CBXP3RD2SL5A5CNFSM4IKMCWGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMO4OTY#issuecomment-589154127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE6MP2M3FXKKPJGSZNSHTDRD2SL5ANCNFSM4IKMCWGA .