clipper
clipper copied to clipboard
Implement a high-performance RPC Query frontend
The current query frontend accepts predictions requests with a REST-like interface using HTTP+JSON. This has the benefit of being easy to use and widely supported, but comes at a performance cost.
The biggest current question is what RPC system to use. Whichever system we use must be able to sustain hundreds of queries per second.
Do we want to write our own RPC client and distribute a single client package that lets users send either REST or RPC requests?
Any update of this issue?
We've implemented a prototype GRPC implementation in an experimental branch, and actually found that it didn't have great performance.
@withsmilo is your interest in this for performance reasons? And if so, would you mind sharing a bit more about what your expected workloads and performance demands are? In particular, are you looking for lower latency (which GRPC should help with), higher throughput (which GRPC will be less helpful for), or both?
@dcrankshaw : Thank you for your reply. I'm searching some methods(GRPC implementation or something else) to cover high workloads on the Query Frontend service. Can I scale Query Frontend service on the Kubernetes environment easily? I found a replica value in the query-frontend-deployment.yaml, but I think that I will need to modify many parts of Clipper sources.
Got it. Unfortunately the query frontend doesn't scale well with Kubernetes quite yet. Here's a brief explanation about why from the Clipper docs. The good news is that there will be some significant performance improvements to the query frontend coming down the pipe soon, and we are actively investigating how to best scale out the query frontend in Kubernetes (#310).
Could you share some benchmark results in GRPC implementation? In the description, "Whichever system we use must be able to sustain hundreds of queries per second", is that the performance goal?
@dcrankshaw : Could you please share some benchmarks for gRPC based query frontend.We are looking for a gRPC based communication channel in addition to REST/HTTP. We do not have a very high throughput requirement but need a gRPC frontend to integrate with our internal tools.Would it make sense to have an gRPC based query frontend as an optional communication channel which can be specified during deployment.
Unfortunately the benchmarks we have are not directly comparable because there were other modifications to the system as well. The more I think about this though the more I think it makes sense to support a GRPC channel as well, both for high(er) performance and because GRPC is a fairly widespread library used in distributed microservice architectures.
@dcrankshaw i do see you have a branch on this ...if its okay with you i would like to take this up, please let me know
@santi81 that would be great! We structured the Clipper codebase with the intention of supporting multiple query frontends, so hopefully it shouldn't be overly complicated. As you noticed, I have a branch with a prototype grpc implementation.
Basically, the way the code is structured is that all of the core system code is in src/libclipper
, and then the query frontend code is in src/frontends
. Basically what you'll want to do is define a GRPC interface that supports the same types of inputs as the REST frontend (batch and single input at a time for strings, floats, doubles, ints, bytes). The only other important logic that goes in the query frontend is the code that listens for events like new applications or models being registered in this section
Go ahead and assign yourself the issue and create a PR once you have a working prototype, and let us know if you have any questions or run into any issues (either by commenting on this issue or pinging us on Slack).
Are there plans to make the interface symmetric? Currently it's easy to send images but very inefficient to return them. With the rise of GANs NNs are returning much more than predictions. We are investigating using Clipper but our network needs to be able to return images and animations.
@pbontrager This is great feedback. Can you expand a bit more on your performance requirements? For example, if we had a zero-copy frontend RPC interface that allowed you to use byte arrays both to send inputs and receive outputs, would that be sufficient? How big are your inputs/outputs, and what types of throughputs/latencies do you expect?
This would be great for our requirements. We would be generating short video clips instead of streaming in realtime. Also, at that point, the bottleneck is problably running the inferences on CPU.