cayley
cayley copied to clipboard
Enable native byte array fields on structs
Auto-converts []byte arrays to strings behind the scenes so the bytes are not stored as individual integers on the graph.
Examples:
// Create some struct with native []byte array field inside
type MyStruct struct {
ID string `json:"id" quad:"@id"`
Name string `json:"name"` // Sample text field
Data []byte `json:"data"` // Sample binary field
}
// Register schema
schema.RegisterType(quad.IRI("mystruct"), MyStruct{})
// Write struct to graph
m := &MyStruct{
ID: "1",
Name: "Sample struct with embedded byte array",
Data: []byte{'H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!'},
}
schema.WriteAsQuads(qw, m)
// Write raw bytes to graph
bytes := quad.ByteArr(m.Data)
schema.WriteAsQuads(qw, bytes)
quad/value.go, line 453 at r1 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
To be compatible with RDF we will probably need to encode it differently.
This may act as a reference, although I'm not sure it's the right one: https://www.w3.org/TR/Content-in-RDF10/#bytesProperty
So maybe the "canonical RDF encoding" is the base64 with the type mentioned in that section.
I thought about that. I implemented base64 encoding/decoding. there is teh https://www.w3.org/TR/xmlschema-2/#base64Binary definition we can use. How should I define it in the code?
schema/writer.go, line 95 at r1 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
It's better to save the result of
reflect.TypeOf([]byte(nil))
to a global and use it here
Done.
voc/schema/schema.go, line 29 at r1 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
This file corresponds to the types defined by https://schema.org, and there is no such type as "ByteArr".
I just made it up so I could test and run the code. I don't think there is a definition for just plain byte array data. Will it break things if we just take it out?
For the tests, I might need your help with the protobuf marshalling/unmarshalling.
With Bolt, I see the marshalling of quad.Bytes to ValueRaw going through, but the return ToNative() returns an empty array so the test fails.
Here's what I have so far, but you may be faster at figuring out where/what's failing... : )
func MakeValue(qv quad.Value) *Value {
...
case quad.Bytes:
return &Value{&Value_Raw{v.Native().([]byte)}}
...
}
// ToNative converts protobuf Value to quad.Value.
func (m *Value) ToNative() (qv quad.Value) {
...
case *Value_Raw:
return quad.Bytes(v.Raw)
...
}
Inching along...Defined quad.Bytes
as proper []byte
wrapper and I see the bytes getting stored as typed strings: "Hello World!"^^<schema:Bytes>
on both Badger and Bolt.
The LoadTypedQuad
Test is still not passing even though it's stored. I suspect resolving the bucket key is part of the issue, but I don't quite understand how the keying and indexing work yet...any comments/pointers/help very much appreciated...
Thnxs!
internal/mapset/map.go, line 36 at r5 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
Do we really need a B-Tree for this simple use case? Why Go maps cannot be used?
Go maps cough up when using quad.Bytes as a map key since it is defined by a slice of bytes. Btree is just a choice. It can be replaced with any proper map-like struct. Perhaps we can do some background trickery to convert strings <-> []bytes in the background...
internal/mapset/set.go, line 7 at r5 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
Same here, do we really need a generic
Set
interface? Why not define a specific implementation?
Not used. will take it out.
quad/value.go, line 450 at r6 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
It should store raw bytes in the value, not base64. It should convert to base64 only in
TypedValue
.
Done.
quad/pquads/quads.go, line 23 at r5 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
It may be better to remove this case and add a generic one before
default
. This case may check for theTypedString()
method on the value (define an interface somewhere) and convert it.So if it recognizes a specific type - it converts directly. If it doesn't, but the values can be converted to
TypeString
- it will be encoded this way as a fallback.
Fixed to store as raw bytes.
quad/pquads/quads.go, line 97 at r5 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
Instead of asserting for a specific type, there should be references to
TypedString.ParseValue()
call - it will check for the known value types and will parse the string with a specified function. Seequad.RegisterStringConversion
.
Fixed. Removed.
voc/schema/schema.go, line 29 at r1 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
Yeah, it should be removed from here, but can be added as an unexported constant to
quad
package.
If we removed this, what does the TypedString type become?
internal/mapset/comparator.go, line 52 at r5 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
This case can be used for every other type except
Time
and[]byte
.
Removed.
internal/mapset/map.go, line 36 at r5 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
They have a special optimization for bytes if you use it like this:
m := make(map[string]interface{}) // cast to string should be inline for optimization to work m[string(b)] = x // this won't be optimized k := string(b) m[s] = x
Removed.
internal/mapset/set.go, line 7 at r5 (raw file):
Previously, fungl164 wrote…
Not used. will take it out.
Removed.
quad/value.go, line 453 at r1 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
There should be some canonical URI/IRI for this type. I'm not sure
https://www.w3.org/TR/xmlschema-2/#
is the right namespace, though. It may be somewhere in XML namespaces - RDF inherits some of them.
I'm thinking converting raw bytes to b64 and wrapping the result as CDATA would actually be RDF-compliant. This may actually be a workable solution since CDATA makes no guarantees and leaves it up to the reader to interpret the contents inside the tag...Just need to ensure the docs reflect what the interpretation should be...or we could just skip the CDATA encapsulation and just rely on the docs...what do you think?
quad/value.go, line 453 at r1 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
CDATA only works for XML form of RDF, which is rarely used. So we will need a namespace anyway for it to work with NQuads/Turtle/JSON-LD formats.
Will need you to take the lead on this.
voc/schema/schema.go, line 29 at r1 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
We just need to find the right type, and it can be removed from here (since it's not related to schema.org), and added as a constant directly in
quad
package.
Ok. I'll remove it when we find the right type.
graph/graphtest/graphtest.go, line 710 at r7 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
Sorry, I meant
quad.IRI("C")
->quad.IRI("bytes")
. Andquad.Raw("<bytes>")
->quad.Bytes("<bytes>")
. This should create the collision that I was talking about.
I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : )
graph/graphtest/graphtest.go, line 710 at r7 (raw file):
Previously, fungl164 wrote…
I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : )
If the extra byte is not enough of a differentiator, adding a stronger suffix and/or prefix should make it work... : )
quad/value.go, line 66 at r5 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
Let's make it a
\x00
prefix instead of a\x01
suffix. If later we decide to change the hash for other value, we will use this extra byte as a value type ID.
Done.
What is the status of this PR?
Need to rebase and check it again.