cayley icon indicating copy to clipboard operation
cayley copied to clipboard

Enable native byte array fields on structs

Open fungl164 opened this issue 6 years ago • 22 comments

Auto-converts []byte arrays to strings behind the scenes so the bytes are not stored as individual integers on the graph.

Examples:

// Create some struct with native []byte array field inside
type MyStruct struct {
	ID   string `json:"id" quad:"@id"`
	Name string `json:"name"`  // Sample text field
	Data []byte `json:"data"`  // Sample binary field
}

// Register schema
schema.RegisterType(quad.IRI("mystruct"), MyStruct{})

// Write struct to graph
m := &MyStruct{
	ID:   "1",
	Name: "Sample struct with embedded byte array",
	Data: []byte{'H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!'},
}
schema.WriteAsQuads(qw, m)


// Write raw bytes to graph
bytes := quad.ByteArr(m.Data)
schema.WriteAsQuads(qw, bytes)

This change is Reviewable

fungl164 avatar Feb 14 '19 21:02 fungl164


quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

To be compatible with RDF we will probably need to encode it differently.

This may act as a reference, although I'm not sure it's the right one: https://www.w3.org/TR/Content-in-RDF10/#bytesProperty

So maybe the "canonical RDF encoding" is the base64 with the type mentioned in that section.

I thought about that. I implemented base64 encoding/decoding. there is teh https://www.w3.org/TR/xmlschema-2/#base64Binary definition we can use. How should I define it in the code?

fungl164 avatar Feb 14 '19 23:02 fungl164


schema/writer.go, line 95 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It's better to save the result of reflect.TypeOf([]byte(nil)) to a global and use it here

Done.

fungl164 avatar Feb 14 '19 23:02 fungl164


voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

This file corresponds to the types defined by https://schema.org, and there is no such type as "ByteArr".

I just made it up so I could test and run the code. I don't think there is a definition for just plain byte array data. Will it break things if we just take it out?

fungl164 avatar Feb 15 '19 00:02 fungl164

For the tests, I might need your help with the protobuf marshalling/unmarshalling.

With Bolt, I see the marshalling of quad.Bytes to ValueRaw going through, but the return ToNative() returns an empty array so the test fails.

Here's what I have so far, but you may be faster at figuring out where/what's failing... : )

func MakeValue(qv quad.Value) *Value {
	...
	case quad.Bytes:
		return &Value{&Value_Raw{v.Native().([]byte)}}
	...
}


// ToNative converts protobuf Value to quad.Value.
func (m *Value) ToNative() (qv quad.Value) {
	...
	case *Value_Raw:
		return quad.Bytes(v.Raw)
	...
}

fungl164 avatar Feb 15 '19 01:02 fungl164

Inching along...Defined quad.Bytes as proper []byte wrapper and I see the bytes getting stored as typed strings: "Hello World!"^^<schema:Bytes> on both Badger and Bolt.

The LoadTypedQuad Test is still not passing even though it's stored. I suspect resolving the bucket key is part of the issue, but I don't quite understand how the keying and indexing work yet...any comments/pointers/help very much appreciated...

Thnxs!

fungl164 avatar Feb 16 '19 05:02 fungl164


internal/mapset/map.go, line 36 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Do we really need a B-Tree for this simple use case? Why Go maps cannot be used?

Go maps cough up when using quad.Bytes as a map key since it is defined by a slice of bytes. Btree is just a choice. It can be replaced with any proper map-like struct. Perhaps we can do some background trickery to convert strings <-> []bytes in the background...

fungl164 avatar Feb 19 '19 06:02 fungl164


internal/mapset/set.go, line 7 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Same here, do we really need a generic Set interface? Why not define a specific implementation?

Not used. will take it out.

fungl164 avatar Feb 19 '19 06:02 fungl164


quad/value.go, line 450 at r6 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It should store raw bytes in the value, not base64. It should convert to base64 only in TypedValue.

Done.

fungl164 avatar Feb 21 '19 16:02 fungl164


quad/pquads/quads.go, line 23 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

It may be better to remove this case and add a generic one before default. This case may check for the TypedString() method on the value (define an interface somewhere) and convert it.

So if it recognizes a specific type - it converts directly. If it doesn't, but the values can be converted to TypeString - it will be encoded this way as a fallback.

Fixed to store as raw bytes.

fungl164 avatar Feb 21 '19 16:02 fungl164


quad/pquads/quads.go, line 97 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Instead of asserting for a specific type, there should be references to TypedString.ParseValue()call - it will check for the known value types and will parse the string with a specified function. See quad.RegisterStringConversion.

Fixed. Removed.

fungl164 avatar Feb 21 '19 16:02 fungl164


voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Yeah, it should be removed from here, but can be added as an unexported constant to quad package.

If we removed this, what does the TypedString type become?

fungl164 avatar Feb 21 '19 16:02 fungl164


internal/mapset/comparator.go, line 52 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

This case can be used for every other type except Time and []byte.

Removed.

fungl164 avatar Feb 21 '19 16:02 fungl164


internal/mapset/map.go, line 36 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

They have a special optimization for bytes if you use it like this:

m := make(map[string]interface{})
// cast to string should be inline for optimization to work
m[string(b)] = x
// this won't be optimized
k := string(b)
m[s] = x

Removed.

fungl164 avatar Feb 21 '19 16:02 fungl164


internal/mapset/set.go, line 7 at r5 (raw file):

Previously, fungl164 wrote…

Not used. will take it out.

Removed.

fungl164 avatar Feb 21 '19 16:02 fungl164


quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

There should be some canonical URI/IRI for this type. I'm not sure https://www.w3.org/TR/xmlschema-2/# is the right namespace, though. It may be somewhere in XML namespaces - RDF inherits some of them.

I'm thinking converting raw bytes to b64 and wrapping the result as CDATA would actually be RDF-compliant. This may actually be a workable solution since CDATA makes no guarantees and leaves it up to the reader to interpret the contents inside the tag...Just need to ensure the docs reflect what the interpretation should be...or we could just skip the CDATA encapsulation and just rely on the docs...what do you think?

fungl164 avatar Feb 21 '19 17:02 fungl164


quad/value.go, line 453 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

CDATA only works for XML form of RDF, which is rarely used. So we will need a namespace anyway for it to work with NQuads/Turtle/JSON-LD formats.

Will need you to take the lead on this.

fungl164 avatar Feb 21 '19 18:02 fungl164


voc/schema/schema.go, line 29 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

We just need to find the right type, and it can be removed from here (since it's not related to schema.org), and added as a constant directly in quad package.

Ok. I'll remove it when we find the right type.

fungl164 avatar Feb 21 '19 18:02 fungl164


graph/graphtest/graphtest.go, line 710 at r7 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Sorry, I meant quad.IRI("C") -> quad.IRI("bytes"). And quad.Raw("<bytes>") -> quad.Bytes("<bytes>"). This should create the collision that I was talking about.

I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : )

fungl164 avatar Feb 21 '19 19:02 fungl164


graph/graphtest/graphtest.go, line 710 at r7 (raw file):

Previously, fungl164 wrote…

I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : )

If the extra byte is not enough of a differentiator, adding a stronger suffix and/or prefix should make it work... : )

fungl164 avatar Feb 21 '19 19:02 fungl164


quad/value.go, line 66 at r5 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Let's make it a \x00 prefix instead of a \x01 suffix. If later we decide to change the hash for other value, we will use this extra byte as a value type ID.

Done.

fungl164 avatar Feb 21 '19 20:02 fungl164

What is the status of this PR?

iddan avatar Sep 19 '19 10:09 iddan

Need to rebase and check it again.

dennwc avatar Sep 23 '19 03:09 dennwc