deploy/runtime: use a background thread to run GC when interpreters aren't executing the forward pass

Open d4l3k opened this issue 3 years ago • 1 comments

To optimize the forward pass latency it would be good to time GC to run in between model executions. This won't improve the QPS since the GC cost is the same amoratized but it would make the latency lower per batch.

import gc

gc.collect()

We should spin up a background thread that periodically iterates over all of the interpreter threads -- locks them between execution and runs the GC. It might also be worth it to explicitly disable GC on the individual interpreter threads so they won't run during the forward pass.

Context:

https://fb.workplace.com/notes/538119557964077/

Jun 13 '22 17:06 d4l3k

FYI, we actually call gc.freeze() after loading the inference model in our online system to reduce GC latency.

Jun 27 '22 03:06 reyoung