clipper
clipper copied to clipboard
First query fails when model reconnects
Steps to reproduce:
- Start Clipper locally (via
bin/start_clipper.sh
) - Connect the no-op container by executing the container script in a python shell
- Register a model associated with the no-op container, create an application, and link the model to the application
- Query the application, observe that queries are routed normally
- Restart the no-op container by killing its associated process and re-launching the container script.
- Send a new query to the application. Observe that it times out.
- Send the new query again. Observe that it is routed to the model and completes on time.
Cause:
-
After processing the first query, the initial instance of the no-op container (replica zero) initiates a callback within the C++ system that waits on the model request queue (https://github.com/ucbrise/clipper/blob/745a6f7387ceaff6e5ffd4378eb3856cbdb6084b/src/libclipper/include/clipper/task_executor.hpp#L449).
-
In the mean time, the container dies. When the container reconnects, it is treated as a new replica (replica one).
-
When a new query is sent, the callback's wait operation completes and the system attempts to send query data to the container with the replica id that was specified when the callback was initiated (replica zero).
-
This query data passes through Clipper's RPC system and is never processed by a recipient
@withsmilo @simon-mo Not sure if this problem still persists? I will add it to the task first.
@rkooo567 Let me check it, but I think the problem still happens.
@withsmilo Thanks a lot! 👍