core
core copied to clipboard
[DO NOT MERGE] Python binding / wrapper look and feel
FIXME
are the sections that further discussion is desired.
The wrapper is restricted that a lot of interaction with in-process API is pre-defined (i.e. how to handle released TRITONSERVER_Request
and TRITONSERVER_Response
). The intention is that user shouldn't need to handle object lifecycle explicitly. The lower level binding will also be provided if they want finer control (i.e. reuse underlying objects to avoid alloc/release overhead).
At the end of _server.py
is a basic example usage of the wrapper. At the end of _infer.py
is an example implementation of allocator.
I was thinking about extracting the model interface as well, but I didn't extract it because it may emphasize the misconception that Triton model operations are with respect to the current object with the specified name/version. Which can be confusing to user in the below scenario:
model_0 = Model("simple", <vision model>)
server.load(model_0)
.... # some later time
model_1 = Model("simple", <language model>)
server.load(model_1)
where user may expect model_0
is unchanged as it seems to be different entity, but the fact that any API usage against model_0
will be pointing to the language model.
However, I do agree that the API can be condensed to preparation of model loading easier, which is the same as what you have proposed, but the name will be something like ModelStorage
instead of Model
to indicate that this is the static representation of the model. The runtime model API will still have to run through a Triton object.
That being said, I still want to explore the possibility of providing the model abstraction. I think in such a case the wrapper will need to encapsulate the Triton model management and impose the limitation / assumption that models must be managed through this wrapper (so it can track all the model changes). For example, now the load API will return a model handle that is just the name of the loaded model, but with additional "valid" attribute that will hint the user that the model may have been changed and let them decide what to do with the "stale" model handle:
model_0 = server.load(ModelStore("simple", <vision model>))
assert(model_0.valid)
.... # some later time
model_1 = server.load(ModelStore("simple", <language model>))
assert(not model_0.valid)
I don't see the necessary of providing synchorinzed infer API, user can read from response iterator after async_infer
return and it will block until response is ready.
I don't see the necessary of providing synchronized infer API, user can read from response iterator after async_infer return and it will block until response is ready.
I think that's a valid argument for the C-API but for Python we need to integrate that with asyncIO. Also, I think it is not just the infer
API. We probably need async
version for all the time-consuming Triton APIs (e.g., load/unload). I'm just wondering whether there is a way to avoid ending up with two async infer APIs (one that uses callback based and another that uses asyncIO). We could provide only provide an asyncIO based async infer API but that would make the sync code a bit awkward. For example, they'd have to run asyncio.run(server.async_infer)
.
Regarding having multiple models with the same name, I think we can document that all the models with same name would point to the same underlying object or we could also error out if the user tries to load another model with the same name.