goavro icon indicating copy to clipboard operation
goavro copied to clipboard

Getting a single binary-encoded object [OCFReader.ReadBinary() function?]

Open floren opened this issue 6 years ago • 8 comments

I'd like to be able to take an OCF as input, then pull out individual binary-encoded objects one at a time. I will store the schema separately and later apply it to these individual objects using Codec.NativeFromBinary.

I can currently use OCFReader.Scan/Read and Codec.BinaryFromNative to get a single binary-encoded object, but the call to OCFReader.Read invokes Codec.NativeFromBinary. Because performance is an important consideration in this, I'd like to be able to skip those redundant decoding/encoding calls.

Ideally, something like OCFReader.ReadBinary() would return a []byte containing exactly one Avro object. I'm trying to implement it, but being unfamiliar with goavro I'm having a hard time figuring out if I can generically pull out the slice of bytes representing a single object without basically reimplementing nativeFromBinary functions for each Avro type. Any suggestions?

floren avatar Apr 20 '18 17:04 floren

Here's a hacky, apparently-functional implementation of ReadBinary, but it still calls NativeFromBinary because I haven't yet figured out how to figure out the length of an object without essentially reimplementing portions of the nativeFromBinary functions of each individual object:

func (ocfr *OCFReader) ReadBinary() ([]byte, error) {
	// NOTE: Test previous error before testing readReady to prevent overwriting
	// previous error.
	if ocfr.rerr != nil {
		return nil, ocfr.rerr
	}
	if !ocfr.readReady {
		ocfr.rerr = errors.New("Read called without successful Scan")
		return nil, ocfr.rerr
	}
	ocfr.readReady = false

	olen := len(ocfr.block)

	// We do this to consume the next entry...
	var nblock []byte
	_, nblock, ocfr.rerr = ocfr.header.codec.NativeFromBinary(ocfr.block)
	if ocfr.rerr != nil {
		return nil, ocfr.rerr
	}
	ret := make([]byte, olen-len(ocfr.block))
	copy(ret, ocfr.block)

	ocfr.block = nblock

	ocfr.remainingBlockItems--

	return ret, nil
}

It does provide approximately a 50% improvement for my use case, but it's still got redundancies.

floren avatar Apr 20 '18 17:04 floren

This is a very sensible feature.

karrick avatar Apr 27 '18 12:04 karrick

Any more consideration for adding support for this method?

I have a use case requiring an efficient way to scan/skip records. I will be maintaining an external index file from a field value to the record offset. So ideally, I would like a way to seek to specific record offsets in order to pull out and materialize the specific records.

This method will reduce the overhead of performing a full table scan since records can be skipped without the overhead of unmarshalling them. A sequential scan using this method along with a way to split up an Avro file to be processed in parallel (if the file is particularly large), would likely be the ideal approach for this use case (as supposed to random access across blocks).

Any thoughts or guidance on this use case for would be appreciated!

bruth avatar Nov 13 '18 02:11 bruth

Peaking through the codec code, it seems to get the best performance, the codec type needs to be extended to support reading the raw bytes without any decoding (which the @floren's code example still does). This means all of the functions that create a codec need to be updated to support this.

bruth avatar Nov 14 '18 00:11 bruth

Each piece of data stored in Avro does not have a header that specifies how many bytes it consumes. That would be cool if it did, and it would allow us to provide a means of addressing this issue, but also several others surrounding schema evolutions.

The closest thing that we get is the OCF format does have a block header that specifies how many bytes are used in each block, so a OCF reader can scan and skip B blocks before it starts decoding block B+1 to find item N.

karrick avatar Jun 13 '19 16:06 karrick

naive question: putting performance aside for a second could you just add ocfr.block to the list of return items?

kmulvey avatar Jul 12 '19 22:07 kmulvey

@kmulvey , I have added the 108 branch as a strawman change so we can toss back and forth ideas about a proper API.

https://github.com/linkedin/goavro/tree/108

karrick avatar Jul 15 '19 17:07 karrick

@karrick I think the changes in the 108 branch would resolve an issue for me. What is the process for this getting merged into master?

tkonya avatar Feb 24 '21 21:02 tkonya