multi-model-server
multi-model-server copied to clipboard
Batch Configuration In Model Archive
Follow up to https://github.com/awslabs/mxnet-model-server/issues/732#issuecomment-470214725 and https://github.com/awslabs/mxnet-model-server/issues/732#issuecomment-470224017
Short Background
Currently, there are 2 ways to register models:
- POSTing the model to the management API
- Writing the model to the model repository on disk (say,
/opt/models
)
Problem
Some model settings, such as batching configuration, are only configurable through the API call but not through the disk repository.
Disk Repository Use Case
Containerized applications should optimally be stateless. This makes them easier to operate and orchestrate. When we manually register models through the management API, the container is no longer stateless as it requires further configuration upon booting. If the container were to crash, it would boot into an unusable state and we'll have to reconfigure it (any model has to be POSTed again with it's configuration).
The problem becomes more severe when using an orchestrator, such as Kubernetes, which creates, relocates and evicts containers on the fly. The created containers would boot into an "unhealthy" state every time until configured.
In contrast, when a container reads the models from a predetermined location (the disk repository), it boots directly into the required state without any further, manual configuration.
Proposition
Having both the management API and the model repository is great. However, some parameters (most noticeably, the batching configuration) are only available through the management API. It would be great if we could bring the two loading methods to feature parity to support use cases similar to the one I described above.
Embedding the additional model settings in the model archive seems like an interesting option, as it will make the settings available to MMS when the model is loaded from the disk.
@vdantu Follow up to https://github.com/awslabs/mxnet-model-server/issues/732#issuecomment-472519392
I think the model section in the model manifest is a good fit for the extra configuration. I'll be happy to open a PR adding the missing batching options.
The way I see it:
- Add optional
--max-batch-size
and--max-batch-delay
tomodel-archiver
. - Parse the new attributes in MMS
- Apply the settings to
Model.java
One thing remains to be decided - if a model is POSTed to the MMS API with different batch settings than the ones embedded in it, should we use the provided settings or the embedded ones? Should we warn the user in that case in one of the logs?
@erandagan : The proposal looks good. My only concern now is how large this model-archiver CLI is going to get :) .
With regards to MGMT API vs Model Archive settings, If the batch-size given through MGMT API is less than or equal to the model-archive settings, then it should be used. Else we should probably throw registration failure. It seems like a decent assumption that the max-batch-size given in model-archive is the max batch size that can be handled by the custom service code.
If the batch-size given through MGMT API is less than or equal to the model-archive settings, then it should be used
Wouldn't that create a double meaning for the batch size argument?
-
When
max-batch-size
is provided through the model archive:- Serves as the batch size
- Serves as the max batch size
-
When
max-batch-size
is provided through the API:- Only serves as the batch size
I'm not entirely sure that's what the users will expect when they configure the model through the archive rather than the API. The configuration keys are the same, yet they behave differently.
Another point for thought - if we do impose limits on the API based on the archive settings, should we do the same for the max batch delay? I believe it might be counter-intuitive.
Hey @vdantu ,just pinging so we can continue the discussion and decide on an approach :)
@erandagan : Really sorry for the delay in responding. We will try to be more interactive going forward.
We will work on a better way to go through the design process.
Regarding your point, I see this as a order of priority for this configuration value. I see that "max_batch_size" and "max_batch_delay" can come in one of the following 2 ways
- Management API - Already exists (Higher priority as it is runtime)
- Model Archive - New (Lower priority as it is static and comes in at model creation time)
I would think the value coming in through Management API should take precedence over the value coming through model-archive.
For example: Lets say we have a model which is written to handle a max batch size of 32. Now, while in service if the service owner finds that there is only 1 request coming in every second or so for this model and waiting for 32 requests is not ideal, wouldn't it better to have an option to over-ride the batch size coming through Management API. If we don't, then the service owners have to repackage the model with newer configuration in the MANIFEST file, which is a more time drawn process IMO.
Do share your thoughts on this. With regards to your question, we should have same behavior for all the options which have an overlap between Model-archive and management API.
I completely agree on having the ability to override the archive's packaged values through the Management API.
My concern was related to setting an upper cap to the API's values based on the Model Archive values:
If the batch-size given through MGMT API is less than or equal to the model-archive settings, then it should be used. Else we should probably throw registration failure.
I believe that such constraint may lead to operational difficulties (eg. you'll have to repackage the model if you want to change the batch size to something higher than the previous packaged maximum).
I think the best approach here would be to make the archive batch size the "default" batch size, while allowing users to override it through the API without any constraints. Reading your latest comment, I think we're on the same page on using that approach.
Let me know what you think!
@erandagan : I see your point. I agree with you. I think making assumptions on customers behalf is a good idea. It may be better to have a simple order of precedence.
Priority 1: Config coming through management API, since its runtime. Priority 2: Config coming through Manifest files, this is model specific. Priority 3: Config coming through Global config file, this is generic server level.
This seems like a good design.
Sounds great, I'll start working on a PR next week.
#778 is ready for review
The discussion involves three different use cases:
- jvm crash (I think this is rare) and recovery case, this is relative simple case, we can do the same as tomcat recover HTTP session. We just serialize model registration into disk and recover it while start. But we have to take care:
- How to do a clean startup if user don't want to recover
- How to differentiate a clean shutdown vs. jvm crash.
- Should we clean up registration on a clean shutdown.
- How to configure startup per model MMS serving settings (batch size, etc), the correct place config.properties. What we need figure out is a nice syntax that allow us to configure per model configuration.
- Model specific configuration (MANIFEST.json in model archiver file). In the context of this issue, I don't think batch size is a good candidate. The main design goal of model archive is:
- portable, model archive should be able to run on any compatible serving platform.
- model archive should NOT be hosting specific, the value batch size really depends traffic and hosting machine. And model developer won't be able to define this value. This value can only be decided while hosting the model.
I think we should start with use-cases. I can think of the following use cases,
- Customers don't necessarily own the Hosting Platform. For example customers coming through Sagemaker or other managed hosting solutions.
- Customers may not have proper access to Management APIs. There isn't a good solution for customers using ECS EC2/Fargate to have access to management ports unless they build a complex system around it.
I think point 1 you mentioned is about MMS audit feature. Assuming that it is, I am not sure how this relates to the discussion. Please feel free to correct me if my understanding is wrong. I think the discussion we would like to have is what is the best place for "model specific" configuration.
I think the points 2 and 3 that you make are in that regard. But regarding these points, I have a few concerns.
- I feel like you are assuming service owners know what models are loaded at startup and know the what the behavior of model should be. This isn't always true. The service owners might be a platform like Sagemaker or any other managed platforms. In this case, model owners could be the best folks to determine the model batch size. Depending on the latency they can withstand, customers providing the models can configure this batch size and max wait times in the Manifest, even if the managed solutions don't provide an interface for this. This can be very useful for customers using GPU hosts for inference.
- Assuming customers want to register 100 (just some number which looks big) models at runtime, this would just increase the size of config.properties and can be error prone. So, ideal solution would be to have a model specific configuration file either outside or inside the model itself. Since model has the MANIFEST which describes the runtime etc of the model, and it is used by MMS frontend, this seems like a good place to have model specific information.
I don't think this feature should be restricted to "batch configuration". I think we should somehow move the "request response sizes", "response timeouts", "default number of workers" etc into this configuration. It gives complete control of how many workers to spin up by default in the hands of customers. I foresee this feature as follows
public class Model() {
// This is the configuration that needs to be used for the Model.
private Properties configuration;
...
public Model() {
configuration = getConfig();
}
public Properties getConfig(){
// Get the properties from ConfigManager and override the values based on MANIFEST file then override the configuration coming from endpoint.
}
}
From model-archive's point of view, it is still a portable archive and I don't see how this is hosting specific. Please feel free to point out anything that I might be missing.
NOTE: If there is no MANIFEST provided or no configuration provided in MANIFEST, this will just fall back to configuration coming from config.properties. So this isn't a one-way door.
Using config.properties
sounds like an interesting idea for statically configuring models without using the archive, with the advantage of zero repackaging and, with #781, the ability to configure models by changing environment variables (which works well for Dockerized apps)
I imagine the syntax (for config.properties
) would be something similar to:
model_name.max_batch_size = 30
model_name.batch_delay_millis = 100
model_name.timeout_millis = 1000
As @vdantu pointed, The disadvantage is that this may lead to very large config files. I wonder whats the common number of models users tend to put in a single server and if it's a common use case.
I think it's important to keep in mind that the goal here is to have repeatable deployments that require no configuration upon startup (not necessarily a crash - could be more servers spinning up or the MMS workload being migrated between servers)
Bumping this - I'd be happy if we could continue the discussion and find an approach that works well for the majority of use cases.
I am convinced that MANIFEST is the right place to put this configuration in. I am gathering other customer use cases for this feature, to back this feature.