vespa icon indicating copy to clipboard operation
vespa copied to clipboard

Add support for binary feed format

Open jobergum opened this issue 3 years ago • 3 comments

Feeding documents with large tensor fields (e.g tensor(p{},dt{},x[128})) using JSON or XML(deprecated) serialization is cumbersome as string representation of float/double is costing a lot of network bandwidth, storage and processing (serialize, deserialize).

image

jobergum avatar Jan 06 '21 14:01 jobergum

should we have a sample docproc that transforms from a binary field to a tensor field?

kkraune avatar Jan 06 '21 14:01 kkraune

We do have an undocumented tool 'vespa-feed-perf' for simple file based usage. It can take a .json or .xml and generate serialized binary documents using our undocumented binary format. You can then compress this file and transfer it. You can then use the same vespa-feed-perf tool and feed it to vespa. This is what is done in some of the performance tests to reduce the amount of data. If you are using the httpclient I guess it can use gzip compression to reduce network cost.

baldersheim avatar Jan 06 '21 14:01 baldersheim

I think the main pain point is storage and the cost of serialization and deserialization including compressing it. To feed from grid I need to convert to json, then transfer it over the wire through vespa http client, then it's deserialized and then converted to vespa binary protocol.

jobergum avatar Jan 06 '21 14:01 jobergum