go_serialization_benchmarks icon indicating copy to clipboard operation
go_serialization_benchmarks copied to clipboard

Mus Strings

Open pascaldekloe opened this issue 1 year ago • 10 comments

Mus does not allocate the string. Instead, it slices from the input buffer (with unsafe.Slice). Changes to the buffer will modify string content. ⚠️

This benchmark creates a copy/instance for implementations which delay string creation, e.g., Cap'n Proto. Mus needs the same treatment for fair comparison.

pascaldekloe avatar Oct 30 '23 12:10 pascaldekloe

PRs are welcome.

alecthomas avatar Oct 30 '23 19:10 alecthomas

First, it should be said that we are talking about the unsafe package. Secondly, the behavior (unsafe behavior) you mentioned ⚠️ will be true not only for a string type, but also for any other data type to which an unsafe type conversion is applied. MUS is not the only one that uses this serialization method, it is also used by Bebop (however, the Benchmark_Bebop_200sc result is not marked as Unsafe), Gencode.

Maybe I don't understand something about the string type?

ymz-ncnk avatar Nov 21 '23 07:11 ymz-ncnk

Strings are interally backed by pointers, whereas integers are not. And as Go always passes by value that means using an unsafe method for integer conversion doesn't result in the output sharing memory with the input. (But this can be easily tested if you want to confirm; the unsafe method for integers is almost the same as a little endian deciding from the stdlib and may actually be slower depending on target architecture)

200sc avatar Nov 21 '23 13:11 200sc

Nothing wrong with unsafe. The problem is that the strings from Mus depend on the buffer, and the benchmark reuses the buffer @ymz-ncnk. You must either create a new buffer or create new strings for such zero copy setups.

pascaldekloe avatar Nov 21 '23 13:11 pascaldekloe

With this approach, we can get zero allocation deserialization. But it turns out that it is not very useful and we do not want to measure such cases?

You must either create a new buffer or create new strings for such zero copy setups.

Each time a new buffer is created in MarshalMUSUnsafe.

ymz-ncnk avatar Nov 21 '23 14:11 ymz-ncnk

Not about usefullness. Zero copy can be great if you only need partial data, thus allocate what you need. Cap'n Proto does the same with lazy getters. We simply need to include those (hidden) costs in the benchmark for good comparison.

pascaldekloe avatar Nov 21 '23 14:11 pascaldekloe

If some results were obtained using an unique feature, we can simply label them with the appropriate name. Wouldn't that solve the problem?

Perhaps over time, other competitors will also introduce this feature.

ymz-ncnk avatar Nov 21 '23 14:11 ymz-ncnk

You're not following the conversation @ymz-ncnk.

pascaldekloe avatar Nov 21 '23 14:11 pascaldekloe

Well, I'm trying to propose a solution. And there is already a similar issue - https://github.com/alecthomas/go_serialization_benchmarks/issues/120.

ymz-ncnk avatar Nov 21 '23 15:11 ymz-ncnk

It turns out that it's not enough to call a package and the results it produces unsafe. I'll try to improve the MUS documentation and also add a comment to the README of the benchmarks. Thank you.

ymz-ncnk avatar Nov 22 '23 14:11 ymz-ncnk

Mus does not allocate the string. Instead, it slices from the input buffer (with unsafe.Slice). Changes to the buffer will modify string content. ⚠️

This benchmark creates a copy/instance for implementations which delay string creation, e.g., Cap'n Proto. Mus needs the same treatment for fair comparison.

It's a correct issue. I mentioned this point with @ymz-ncnk however I cant prove my reason to him and I had to fix that bug for my personal project. By the way Mus-go is one of the bests in this field.

nazarifard avatar Jun 09 '24 16:06 nazarifard

Just to be clear. Let's look at two cases. First one:

for {
	bs := ReadData() // Each time ReadData() returns a new bs! We can choose 
	// this behavior if reading is slow or if processing the data is much more 
	// expensive than creating a new slice.
	go processData(bs) // Here we can unmarshal strings using unsafe package. If
	// we will NOT MODIFY bs THEMSELVES we will receive a good performance boost
	// and evrything will work as expected.
}

And another:

var bs []byte
for {
	bs = ReadData() // Each time ReadData() returns the same bs.
	processData(bs) // In this case, we must process the data sequentially and 
	// by using unsafe package we can archive mentioned behaviour - next read 
	// will modify already unmarshaled strings. But if we're done with the data 
	// until the next read (for example, saved it to disk), it may not be a 
	// problem.
}

You can use unsafe package or not, it all depends on your task. @nazarifard

ymz-ncnk avatar Jun 09 '24 18:06 ymz-ncnk

However @ymz-ncnk doesn't accept my point yet, But fortunately he fixed the bug and current version hasn't that bug.

old versions were using the following these three lines to move data from bs to v, that was wrong obviously! actually return input itself instead of cloning new similar output.

  c := bs[n : n+length] // old line wrong. 
  return unsafe_mod.String(&c[0], len(c)), n + length, nil

newer version fixed the problem and returns data by a correct way:

return unsafe_mod.String(&bs[n], length), l, nil

nazarifard avatar Jun 09 '24 22:06 nazarifard

Just to be clear. Let's look at two cases. First one:

for {
	bs := ReadData() // Each time ReadData() returns a new bs! We can choose 
	// this behavior if reading is slow or if processing the data is much more 
	// expensive than creating a new slice.
	go processData(bs) // Here we can unmarshal strings using unsafe package. If
	// we will NOT MODIFY bs THEMSELVES we will receive a good performance boost
	// and evrything will work as expected.
}

And another:

var bs []byte
for {
	bs = ReadData() // Each time ReadData() returns the same bs.
	processData(bs) // In this case, we must process the data sequentially and 
	// by using unsafe package we can archive mentioned behaviour - next read 
	// will modify already unmarshaled strings. But if we're done with the data 
	// until the next read (for example, saved it to disk), it may not be a 
	// problem.
}

You can use unsafe package or not, it all depends on your task. @nazarifard

As mentioned fortunately the issue is fixed by newer versions of Mus-go. But this logic is incorrect really. Based on this logic the following code implements a high performance deep copy function in go lang. ;)

func deepfake(a any) b any {
 return a
}

nazarifard avatar Jun 09 '24 22:06 nazarifard

However @ymz-ncnk doesn't accept my point yet, But fortunately he fixed the bug and current version hasn't that bug.

old versions were using the following these three lines to move data from bs to v, that was wrong obviously! actually return input itself instead of cloning new similar output.

c := bs[n : length+1]
 v = string(bs[n : length+1])
 return v, n + length, nil

newer version fixed the problem and returns data by a correct way:

return unsafe_mod.String(&bs[n], length), l, nil

The first one copies the byte slice and creates a string pointing to that copy. The second, creates a string that points to the given byte slice.

ymz-ncnk avatar Jun 09 '24 22:06 ymz-ncnk

Just to be clear. Let's look at two cases. First one:

for {
	bs := ReadData() // Each time ReadData() returns a new bs! We can choose 
	// this behavior if reading is slow or if processing the data is much more 
	// expensive than creating a new slice.
	go processData(bs) // Here we can unmarshal strings using unsafe package. If
	// we will NOT MODIFY bs THEMSELVES we will receive a good performance boost
	// and evrything will work as expected.
}

And another:

var bs []byte
for {
	bs = ReadData() // Each time ReadData() returns the same bs.
	processData(bs) // In this case, we must process the data sequentially and 
	// by using unsafe package we can archive mentioned behaviour - next read 
	// will modify already unmarshaled strings. But if we're done with the data 
	// until the next read (for example, saved it to disk), it may not be a 
	// problem.
}

You can use unsafe package or not, it all depends on your task. @nazarifard

As mentioned fortunately the issue is fixed by newer versions of Mus-go. But this logic is incorrect really. Based on this logic the following code implements a high performance deep copy function in go lang. ;)

func fakeCopy(a any) b any {
 return a
}

I literally do not understand you. If you find a bug, please open an issue with the code on how to reproduce it.

ymz-ncnk avatar Jun 09 '24 22:06 ymz-ncnk

@ymz-ncnk I am speaking exactly about the following line 63 of unsafe string.go Here this line of code was obviously wrong and fortunately its fixed in newer version > v0.1.3

  c := bs[n : n+length] // old line wrong. 
  return unsafe_mod.String(&c[0], len(c)), n + length, nil
return unsafe_mod.String(&bs[n], length), l, nil // current fixed correct code

near one year ago on "19 May 2023" I tried to show this problem to you. But finally I cant explain why this code should be refine, therefor I had to fix the code locally for my personal project.

History of our arguments is availiable from here UnMarshal interference As mentioned [by above link] UnMarshal interference the following code reproduced the bug:

var a, b String
a:="aaa"; a.Marshal(buf); a.UnMarshal(buf); //a="aaa"
b:="bbb"; b.marshal(buf); b.UnMarsal(buf); //a="bbb" !

nazarifard avatar Jun 10 '24 01:06 nazarifard

I don't want to disappoint you, but this is still true even now (and this is not how the unsafe package should be used):

func TestUnsafe(t *testing.T) {
	bs := make([]byte, 10)

	a := "aaa"
	unsafe.MarshalString(a, bs)
	a, _, _ = unsafe.UnmarshalString(bs)
	assert.Equal(a, "aaa", t)

	b := "bbb"
	unsafe.MarshalString(b, bs)
	b, _, _ = unsafe.UnmarshalString(bs)
	assert.Equal(b, "bbb", t)
	assert.Equal(a, "bbb", t) // !!!
}

If for some reason you don't like unsafe code or can't use it, there is an ord package that can marshal/unmarshal strings in a safe way:

func TestSafe(t *testing.T) {
	bs := make([]byte, 10)

	a := "aaa"
	ord.MarshalString(a, bs)
	a, _, _ = ord.UnmarshalString(bs)
	assert.Equal(a, "aaa", t)

	b := "bbb"
	ord.MarshalString(b, bs)
	b, _, _ = ord.UnmarshalString(bs)
	assert.Equal(b, "bbb", t)
	assert.Equal(a, "aaa", t) // !!!
}

@nazarifard

ymz-ncnk avatar Jun 10 '24 08:06 ymz-ncnk

A reasonable solution is to have two benchmarks and name them appropriately so that users can make distinguish between them. How about we stop discussing it and someone sends a PR?

alecthomas avatar Jun 10 '24 09:06 alecthomas

I don't want to disappoint you, but this is still true even now (and this is not how the unsafe package should be used):

func TestUnsafe(t *testing.T) {
	bs := make([]byte, 10)

	a := "aaa"
	unsafe.MarshalString(a, bs)
	a, _, _ = unsafe.UnmarshalString(bs)
	assert.Equal(a, "aaa", t)

	b := "bbb"
	unsafe.MarshalString(b, bs)
	b, _, _ = unsafe.UnmarshalString(bs)
	assert.Equal(b, "bbb", t)
	assert.Equal(a, "bbb", t) // !!!
}

If for some reason you don't like unsafe code or can't use it, there is an ord package that can marshal/unmarshal strings in a safe way:

func TestSafe(t *testing.T) {
	bs := make([]byte, 10)

	a := "aaa"
	ord.MarshalString(a, bs)
	a, _, _ = ord.UnmarshalString(bs)
	assert.Equal(a, "aaa", t)

	b := "bbb"
	ord.MarshalString(b, bs)
	b, _, _ = ord.UnmarshalString(bs)
	assert.Equal(b, "bbb", t)
	assert.Equal(a, "aaa", t) // !!!
}

@nazarifard

With all respect I think your understanding of unsafe is not sound here. you can compare the results of newer versions and old versions of mus-go. Both method must not change "a" when we aren't engaging with "a" at all. However fortunately current version of mus-go works fine and returns correct value in both safe and unsafe cases. on the opposite side old versions changed input bs argument unwantedly. It is obviously wrong function.

func deepfakeCopy(a any) b any {return a}

Do you think deepfakeCopy implemented unsafe copy really? Its a irrefusable incorrect implemention.

nazarifard avatar Jun 10 '24 09:06 nazarifard

A reasonable solution is to have two benchmarks and name them appropriately so that users can make distinguish between them. How about we stop discussing it and someone sends a PR?

We already have two distinct benchmarks - Benchmark_MUSUnsafe and Benchmark_MUS.

ymz-ncnk avatar Jun 10 '24 09:06 ymz-ncnk

Excellent, then I'm going to close and lock this. Feel free to continue the discussion elsewhere.

alecthomas avatar Jun 10 '24 09:06 alecthomas

There is nothing to discuss for now. The issue is fixed and last version works finely. We are only sharing our technical insight about a specific topic. By the way I love Mus-go really. It can be one of bests from my point of view. @ymz-ncnk @alecthomas

nazarifard avatar Jun 10 '24 09:06 nazarifard

@ymz-ncnk You consistently refuse to see any point made. Please stop messaging here or I will ban you.

pascaldekloe avatar Jun 10 '24 09:06 pascaldekloe