clipper First query fails when model reconnects

First query fails when model reconnects

Open Corey-Zumar opened this issue 7 years ago • 3 comments

Steps to reproduce:

Start Clipper locally (via bin/start_clipper.sh)
Connect the no-op container by executing the container script in a python shell
Register a model associated with the no-op container, create an application, and link the model to the application
Query the application, observe that queries are routed normally
Restart the no-op container by killing its associated process and re-launching the container script.
Send a new query to the application. Observe that it times out.
Send the new query again. Observe that it is routed to the model and completes on time.

Cause:

After processing the first query, the initial instance of the no-op container (replica zero) initiates a callback within the C++ system that waits on the model request queue (https://github.com/ucbrise/clipper/blob/745a6f7387ceaff6e5ffd4378eb3856cbdb6084b/src/libclipper/include/clipper/task_executor.hpp#L449).
In the mean time, the container dies. When the container reconnects, it is treated as a new replica (replica one).
When a new query is sent, the callback's wait operation completes and the system attempts to send query data to the container with the replica id that was specified when the callback was initiated (replica zero).
This query data passes through Clipper's RPC system and is never processed by a recipient

Jan 19 '18 00:01 Corey-Zumar

@withsmilo @simon-mo Not sure if this problem still persists? I will add it to the task first.

Jun 06 '19 04:06 rkooo567

@rkooo567 Let me check it, but I think the problem still happens.

Jun 06 '19 08:06 withsmilo

@withsmilo Thanks a lot! 👍

Jun 06 '19 20:06 rkooo567