ml-commons
ml-commons copied to clipboard
[FEATURE] Support the deployment of Small Language Models
This is a feature request to deploy Small Language Models (SLM) (3b or 1b). SLMs are improving quickly and are becoming good choice for narrowed scope usecases.
Examples can be TinyLlama, Minichat, Phi-2 and several others that are getting popular. I expect SLMs to be more popular than LLMs in the next few years especially that they are cheap to finetune. So, it is good to have them as part of the OpenSearch echosystem.
Eventually, I expect these models to be called with the _predict API of ml-common to generate a number of tokens specified in the body as shown below:
_predict { "prompt":"what is OpenSearch", "model_id":"some id", "num_of_tokens":4 }
Response:
{ "response": "OpenSearch is an open" }
I also eventually expect this to be integrated with neural-search plugin, if not already.
@asfoorial Thanks for cutting this issue. We are thinking about similar thing to support more models. For your use case, do you seem any concern for deploying model outside of OpenSearch cluster and run it as externally hosted model?
That is one possible option, however OpenSearch seems to be limited to only OpenAI and Amazon LLM services. Hosting local models inside of OpenSearch would be more suitable for small models.
OpenSearch seems to be limited to only OpenAI and Amazon LLM services.
Actually that's not the case, the connector framework can support any external hosted models if they support http. For example, Cohere model, Aleph Alpha model, and Huggingface model
Hosting local models inside of OpenSearch would be more suitable for small models.
Could you provide further elaboration on this?
I might want to create my own on-prem LLM service and host it on servers in my local network, how can I connect it to OpenSearch?
I have use case as well that involves using an externally hosted model that is self hosted and located within a private network (or more simply another use cases is if i'm using an api gateway that has a private ip address ), however it seems there is a hard coded requirement that externally hosted models can not have a private ip address: https://github.com/opensearch-project/ml-commons/blob/0903d5da4bc9fb8051621de05759dbdd36613972/ml-algorithms/src/main/java/org/opensearch/ml/engine/httpclient/MLHttpClientFactory.java#L79
This seems like an arbitrary check, which i think should either be removed or made to be a config.
I have use case as well that involves using an externally hosted model that is self hosted and located within a private network (or more simply another use cases is if i'm using an api gateway that has a private ip address ), however it seems there is a hard coded requirement that externally hosted models can not have a private ip address:
https://github.com/opensearch-project/ml-commons/blob/0903d5da4bc9fb8051621de05759dbdd36613972/ml-algorithms/src/main/java/org/opensearch/ml/engine/httpclient/MLHttpClientFactory.java#L79
This seems like an arbitrary check, which i think should either be removed or made to be a config.
I encountered the same issue as well. I think the hard-coded part should be a configuration we can control.
@JohnUiterwyk I think that's a fair ask. Could you please cut an issue for this? Thanks
thanks @dhrubo-os, i've raised an issue for that request specifically: https://github.com/opensearch-project/ml-commons/issues/2142
Just to be more specific, I am talking about quantized GGUF models and in the 3-b or smaller range. Examples llmware/slim-sql-tool and GeneZC/MiniChat-2-3B
They are good for their size and suitable for lightweight tasks.
what is the approximate size of a 3b SLM? is it at 1GB level?
It can be more. Vanilla MiniChat in huggingface is 6GB. But I expect this feature to handle quantized models in GGUF formats, which also exceeds 1GB, in the range of 1-3GB depending on the level of quantization.
@ylwu-amzn , how can I connect my self-hosted LLMs to OpenSearch? glad to hear from you asap. thx.
@ylwu-amzn , how can I connect my self-hosted LLMs to OpenSearch? glad to hear from you asap. thx.
One possible solution is you create some load balancer in front of your self-hosted LLM server, then you can use the load balancer URL to create a remote model. Let me know if you have problem for this solution.
hi @ylwu-amzn, please consider that exposing a self hosted LLM server to the public internet via a internet facing load balancer is a not a workable solution given the security risks this would introduce. In many organisations, deploying a such a load balancer would violate internal security guidelines and would likely be blocked by service control policies.
in addition to the request raised separately to remove the private ip restriction for remote models, i think the request here is to support serving small language models for text completion from open search much the way embedding models can be hosted by the cluster.
One example use case for this would be a multistage ingest pipeline that, first used a small language model to generate example questions for a given document, and then used an embedding model to create the embedding for the questions, which would produce a higher score match against a users query.
@JohnUiterwyk would you be open to producing a PR for this?
@ylwu-amzn , how can I connect my self-hosted LLMs to OpenSearch? glad to hear from you asap. thx.
One possible solution is you create some load balancer in front of your self-hosted LLM server, then you can use the load balancer URL to create a remote model. Let me know if you have problem for this solution.
Thanks for your reply. I've finished all steps connecting my self-hosted LLMs except _predict
one. The error msg is listed below:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "192.168.31.22"
}
],
"type": "illegal_argument_exception",
"reason": "192.168.31.22"
},
"status": 400
}
So, what should I do to make it work? and the reason why I get this illegal_argument_exception
?
Thank you again for you kindness.
I've found the reason (I'm using a private ip address). Should it be a configuration we can control for checking hasPrivateIpAddress
.
@MonkeyKing-KK Did you find a work around for the private address restriction (which i find ridiculous) ?
Anyone know why it was limited this way?
Thanks
@whittssg It's this code which limits we can't use private IP https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/main/java/org/opensearch/ml/engine/httpclient/MLHttpClientFactory.java#L76
This is mainly for security concern. For example, if you have OpenSearch service and your other private internal service running on same server, and someone create a connector with your private service, then they can call your private service.
Suggest create a model service with a HTTP endpoint and authentication, then create a connector with the endpoint.
@ylwu-amzn thanks for the reply.. I am just curious how if we are running both the opensearch instance and LLM locally why there is a security concern (i can understand for the hosted stuff) ?
Is there some documentation for your suggesiton:
"Suggest create a model service with a HTTP endpoint and authentication, then create a connector with the endpoint"
Thank you.
@whittssg , I haven't verified this idea, but it should theoretically work if you have a DNS or a load balancer set up in front of your model service. Don't use direct local IP should be ok.
The local IP is expected to be in the same private network but in a different server. How can we connect to it? I don't see any security concern in that especially that the OpenSearch admin is the one that is configuring it so he/she shows what he/she is doing
We had a discussion with security guys, they are ok to add a setting for allowing private IP. So user can control whether enable it or not. The setting should be disabled by default. User can enable it if they need. That can solve the problem.
@ylwu-amzn glad to hear that. BTY, when this setting will be open to users?
@MonkeyKing-KK , It will be in 2.15. https://opensearch.org/releases.html
+1 on this! Looking forward to being able to host my own openai compatible models and use with this.
@ylwu-amzn haven't seen changes and 2.15. is planned to be released this week. any update?
Just to add to all of this.. there is currently a very suitable SLM in onnx format which is Phi-3 (https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx and https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi-3-tutorial.md) It runs at a very decent speed in my own CPU-only 5 years old laptop and the accuracy is quite reasonable. I highly recommend to support hosting and integrating it natively within OpenSearch. I even recommend defining OpenSearch specific prompts to perform certain tasks. I remember in the community there were talks about hosting NER models. I believe that this, with a bit of predefined prompts will do the job very well. Supporting such model will give OpenSearch the edge over other open source alternatives.
@ylwu-amzn haven't seen changes and 2.15. is planned to be released this week. any update?
The PR is out https://github.com/opensearch-project/ml-commons/pull/2534
Just to add to all of this.. there is currently a very suitable SLM in onnx format which is Phi-3 (https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx and https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi-3-tutorial.md) It runs at a very decent speed in my own CPU-only 5 years old laptop and the accuracy is quite reasonable. I highly recommend to support hosting and integrating it natively within OpenSearch. I even recommend defining OpenSearch specific prompts to perform certain tasks. I remember in the community there were talks about hosting NER models. I believe that this, with a bit of predefined prompts will do the job very well. Supporting such model will give OpenSearch the edge over other open source alternatives.
@asfoorial Thanks a lot for sharing this. Sounds doable. @jngz-es Can you do some research?