vespa
vespa copied to clipboard
Add support for binary feed format
Feeding documents with large tensor fields (e.g tensor(p{},dt{},x[128})) using JSON or XML(deprecated) serialization is cumbersome as string representation of float/double is costing a lot of network bandwidth, storage and processing (serialize, deserialize).
should we have a sample docproc that transforms from a binary field to a tensor field?
We do have an undocumented tool 'vespa-feed-perf' for simple file based usage. It can take a .json or .xml and generate serialized binary documents using our undocumented binary format. You can then compress this file and transfer it. You can then use the same vespa-feed-perf tool and feed it to vespa. This is what is done in some of the performance tests to reduce the amount of data. If you are using the httpclient I guess it can use gzip compression to reduce network cost.
I think the main pain point is storage and the cost of serialization and deserialization including compressing it. To feed from grid I need to convert to json, then transfer it over the wire through vespa http client, then it's deserialized and then converted to vespa binary protocol.