clipper icon indicating copy to clipboard operation
clipper copied to clipboard

First query fails when model reconnects

Open Corey-Zumar opened this issue 7 years ago • 3 comments

Steps to reproduce:

  1. Start Clipper locally (via bin/start_clipper.sh)
  2. Connect the no-op container by executing the container script in a python shell
  3. Register a model associated with the no-op container, create an application, and link the model to the application
  4. Query the application, observe that queries are routed normally
  5. Restart the no-op container by killing its associated process and re-launching the container script.
  6. Send a new query to the application. Observe that it times out.
  7. Send the new query again. Observe that it is routed to the model and completes on time.

Cause:

  • After processing the first query, the initial instance of the no-op container (replica zero) initiates a callback within the C++ system that waits on the model request queue (https://github.com/ucbrise/clipper/blob/745a6f7387ceaff6e5ffd4378eb3856cbdb6084b/src/libclipper/include/clipper/task_executor.hpp#L449).

  • In the mean time, the container dies. When the container reconnects, it is treated as a new replica (replica one).

  • When a new query is sent, the callback's wait operation completes and the system attempts to send query data to the container with the replica id that was specified when the callback was initiated (replica zero).

  • This query data passes through Clipper's RPC system and is never processed by a recipient

Corey-Zumar avatar Jan 19 '18 00:01 Corey-Zumar

@withsmilo @simon-mo Not sure if this problem still persists? I will add it to the task first.

rkooo567 avatar Jun 06 '19 04:06 rkooo567

@rkooo567 Let me check it, but I think the problem still happens.

withsmilo avatar Jun 06 '19 08:06 withsmilo

@withsmilo Thanks a lot! 👍

rkooo567 avatar Jun 06 '19 20:06 rkooo567