pgx icon indicating copy to clipboard operation
pgx copied to clipboard

storing a string Go value into bytea PostgreSQL column works, but reading it back fails

Open mitar opened this issue 1 year ago • 14 comments

Is your feature request related to a problem? Please describe.

With recent Go improvements to convert strings from and to []byte without copying I would like storing strings into bytea column works transparently, with as little copying as possible.

I am writing a key-value store on top of PostgreSQL and I would like to simplify storing of built-in strings so that users of my library would not have to convert them to []byte.

Describe the solution you'd like

It looks like storing a string Go value works already, but reading back fails with:

cannot scan bytea (OID 17) in binary format into *string

I do not see much reason why scanning bytea into string would not just work for standard types with pgx? Go allows conversion between string and []byte with simple casting, I would hope that storing into a string into the database would work the same.

Describe alternatives you've considered

My understanding is that I could register my own type codec, but I could not find much information about how to an existing codec: I would like that default bytea handling stays the same, only that strings are additionally scanned without copying.

mitar avatar Mar 31 '24 12:03 mitar

I suspect there are unresolvable ambiguities regarding string <> bytea.

Writing the string to a PostgreSQL bytea at least appears to work because pgx sends any string to PostgreSQL in the text format without any conversion. pgx does the same thing when reading. Any text formatted value can be scanned directly into a string without any conversion.

But what happens when you read the value? In the text format, PostgreSQL will return the value hex encoded, but in binary you will get back the raw bytes.

For example: if you write the 4 bytes jack to a bytea you will get back the ten bytes \x6a61636b in the text format but the original 4 bytes in binary format.

Even working directly in psql it can be confusing dealing with strings in bytea. I recommend that you instead use text or varchar if possible.

My understanding is that I could register my own type codec, but I could not find much information about how to an existing codec: I would like that default bytea handling stays the same, only that strings are additionally scanned without copying.

If you really need/want to do this, you probably don't have to create a new Codec. I think you would only need to create your own type that implements pgtype.BytesScanner and pgtype.BytesValuer.

jackc avatar Apr 14 '24 02:04 jackc

Thanks for the response.

If you really need/want to do this, you probably don't have to create a new Codec. I think you would only need to create your own type that implements pgtype.BytesScanner and pgtype.BytesValuer.

But that would require one to scan into this custom type, not string itself.

For string itself, it seems like I would have to unregister ByteaCodec and register a custom one? Am I mistaken? Or is there an easier way? Maybe just using a TryWrapScanPlanFuncs?

mitar avatar Apr 14 '24 09:04 mitar

I am just thinking that since Go 1.20, there is cheap way to convert between string and []byte:

  • https://pkg.go.dev/unsafe@master#StringData
  • https://pkg.go.dev/unsafe@master#SliceData & https://pkg.go.dev/unsafe@master#String
func String2ByteSlice(str string) []byte {
	return unsafe.Slice(unsafe.StringData(str), len(str))
}

func ByteSlice2String(bs []byte) string {
	if len(bs) == 0 {
		return ""
	}
	return unsafe.String(unsafe.SliceData(bs), len(bs))
}

So maybe pgx could just simply convert between them as necessary.

mitar avatar Apr 14 '24 16:04 mitar

For string itself, it seems like I would have to unregister ByteaCodec and register a custom one? Am I mistaken? Or is there an easier way? Maybe just using a TryWrapScanPlanFuncs?

Yeah, either of these is a possibility, though you will still have the ambiguity of text vs. binary protocol.

I am just thinking that since Go 1.20, there is cheap way to convert between string and []byte:

I don't really see where these could be safely used. The read buffer is reused for every row. The unsafely constructed string would be corrupted by future rows.

jackc avatar Apr 15 '24 12:04 jackc

I don't really see where these could be safely used. The read buffer is reused for every row. The unsafely constructed string would be corrupted by future rows.

Oh, you are right, for scanning it does not work.

mitar avatar Apr 15 '24 12:04 mitar

Little late to the party but this work for me @mitar

var pw []byte
err := DBPool.QueryRow(context.Background(), fmt.Sprintf(query, user_table_quoted), email).Scan(&pw)
return string(pw), err

vphatfla avatar Sep 10 '24 23:09 vphatfla

@vphatfla Yes, I would like that it works for Go string values, not Go []byte values (which already works).

mitar avatar Sep 11 '24 07:09 mitar

@mitar, oh I see, I'm noob here, might I ask how expensive it is to repeatedly convert from []byte to string and vice versa?

vphatfla avatar Sep 12 '24 13:09 vphatfla

With recent changes in Go it is very cheap. But the issue here is about ergonomics, if pgx does that internally, then it can do it a) efficiently b) without developer having to do it manually every time.

mitar avatar Sep 12 '24 13:09 mitar

But isn't it make more sense to force developers to convert the string to []byte before saving to the postgres instead? My thinking is that, the data type in postgres is BYTEA, so logically, it should expect the []byte to be saved and retrieved instead of string. Doing it this way might help the new developers have a clear view of the system and thus, make fewer bugs in the future.

vphatfla avatar Sep 12 '24 13:09 vphatfla

Sure. This is a design question for the library. It storing string into bytea an error or a feature. This is why the issue exists so that the community and maintainers can discuss the design here.

My view is that I would prefer that this is done by library automatically. Generally developers should have good understanding what is the column type and what is data type they are storing and library should try to accommodate and make this easy (especially when it is (now) possible to do it efficiently) to keep boilerplate and repeated code down.

mitar avatar Sep 12 '24 13:09 mitar

My view is that I would prefer that this is done by library automatically. Generally developers should have good understanding what is the column type and what is data type they are storing and library should try to accommodate and make this easy (especially when it is (now) possible to do it efficiently) to keep boilerplate and repeated code down.

I agree, if it can be done safely. But as I mentioned above (https://github.com/jackc/pgx/issues/1977#issuecomment-2053856959), I don't know how it can be done without possibility of silently scanning the data incorrectly.

jackc avatar Sep 13 '24 01:09 jackc