ml-commons [FEATURE] Support the deployment of Small Language Models

This is a feature request to deploy Small Language Models (SLM) (3b or 1b). SLMs are improving quickly and are becoming good choice for narrowed scope usecases.

Examples can be TinyLlama, Minichat, Phi-2 and several others that are getting popular. I expect SLMs to be more popular than LLMs in the next few years especially that they are cheap to finetune. So, it is good to have them as part of the OpenSearch echosystem.

Eventually, I expect these models to be called with the _predict API of ml-common to generate a number of tokens specified in the body as shown below:

_predict { "prompt":"what is OpenSearch", "model_id":"some id", "num_of_tokens":4 }

Response:

{ "response": "OpenSearch is an open" }

I also eventually expect this to be integrated with neural-search plugin, if not already.

Feb 17 '24 18:02 asfoorial

@asfoorial Thanks for cutting this issue. We are thinking about similar thing to support more models. For your use case, do you seem any concern for deploying model outside of OpenSearch cluster and run it as externally hosted model?

Feb 18 '24 18:02 ylwu-amzn

That is one possible option, however OpenSearch seems to be limited to only OpenAI and Amazon LLM services. Hosting local models inside of OpenSearch would be more suitable for small models.

Feb 18 '24 19:02 asfoorial

OpenSearch seems to be limited to only OpenAI and Amazon LLM services.

Actually that's not the case, the connector framework can support any external hosted models if they support http. For example, Cohere model, Aleph Alpha model, and Huggingface model

Hosting local models inside of OpenSearch would be more suitable for small models.

Could you provide further elaboration on this?

Feb 19 '24 00:02 ylwu-amzn

I might want to create my own on-prem LLM service and host it on servers in my local network, how can I connect it to OpenSearch?

Feb 19 '24 04:02 asfoorial

I have use case as well that involves using an externally hosted model that is self hosted and located within a private network (or more simply another use cases is if i'm using an api gateway that has a private ip address ), however it seems there is a hard coded requirement that externally hosted models can not have a private ip address: https://github.com/opensearch-project/ml-commons/blob/0903d5da4bc9fb8051621de05759dbdd36613972/ml-algorithms/src/main/java/org/opensearch/ml/engine/httpclient/MLHttpClientFactory.java#L79

This seems like an arbitrary check, which i think should either be removed or made to be a config.

Feb 20 '24 11:02 JohnUiterwyk

I have use case as well that involves using an externally hosted model that is self hosted and located within a private network (or more simply another use cases is if i'm using an api gateway that has a private ip address ), however it seems there is a hard coded requirement that externally hosted models can not have a private ip address:

https://github.com/opensearch-project/ml-commons/blob/0903d5da4bc9fb8051621de05759dbdd36613972/ml-algorithms/src/main/java/org/opensearch/ml/engine/httpclient/MLHttpClientFactory.java#L79

This seems like an arbitrary check, which i think should either be removed or made to be a config.

I encountered the same issue as well. I think the hard-coded part should be a configuration we can control.

Feb 20 '24 11:02 asfoorial

@JohnUiterwyk I think that's a fair ask. Could you please cut an issue for this? Thanks

Feb 20 '24 18:02 dhrubo-os

thanks @dhrubo-os, i've raised an issue for that request specifically: https://github.com/opensearch-project/ml-commons/issues/2142

Feb 21 '24 05:02 JohnUiterwyk

Just to be more specific, I am talking about quantized GGUF models and in the 3-b or smaller range. Examples llmware/slim-sql-tool and GeneZC/MiniChat-2-3B

They are good for their size and suitable for lightweight tasks.

Feb 21 '24 16:02 asfoorial

what is the approximate size of a 3b SLM? is it at 1GB level?

Feb 27 '24 19:02 Zhangxunmt

It can be more. Vanilla MiniChat in huggingface is 6GB. But I expect this feature to handle quantized models in GGUF formats, which also exceeds 1GB, in the range of 1-3GB depending on the level of quantization.

Feb 27 '24 19:02 asfoorial

@ylwu-amzn , how can I connect my self-hosted LLMs to OpenSearch? glad to hear from you asap. thx.

Feb 28 '24 15:02 MonkeyKing-KK

@ylwu-amzn , how can I connect my self-hosted LLMs to OpenSearch? glad to hear from you asap. thx.

One possible solution is you create some load balancer in front of your self-hosted LLM server, then you can use the load balancer URL to create a remote model. Let me know if you have problem for this solution.

Feb 28 '24 17:02 ylwu-amzn

hi @ylwu-amzn, please consider that exposing a self hosted LLM server to the public internet via a internet facing load balancer is a not a workable solution given the security risks this would introduce. In many organisations, deploying a such a load balancer would violate internal security guidelines and would likely be blocked by service control policies.

in addition to the request raised separately to remove the private ip restriction for remote models, i think the request here is to support serving small language models for text completion from open search much the way embedding models can be hosted by the cluster.

One example use case for this would be a multistage ingest pipeline that, first used a small language model to generate example questions for a given document, and then used an embedding model to create the embedding for the questions, which would produce a higher score match against a users query.

Feb 29 '24 06:02 JohnUiterwyk

@JohnUiterwyk would you be open to producing a PR for this?

Feb 29 '24 19:02 austintlee

@ylwu-amzn , how can I connect my self-hosted LLMs to OpenSearch? glad to hear from you asap. thx.

One possible solution is you create some load balancer in front of your self-hosted LLM server, then you can use the load balancer URL to create a remote model. Let me know if you have problem for this solution.

Thanks for your reply. I've finished all steps connecting my self-hosted LLMs except _predict one. The error msg is listed below:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "192.168.31.22"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "192.168.31.22"
  },
  "status": 400
}

So, what should I do to make it work? and the reason why I get this illegal_argument_exception? Thank you again for you kindness.

I've found the reason (I'm using a private ip address). Should it be a configuration we can control for checking hasPrivateIpAddress.

Mar 05 '24 07:03 MonkeyKing-KK

@MonkeyKing-KK Did you find a work around for the private address restriction (which i find ridiculous) ?

Anyone know why it was limited this way?

Thanks

May 01 '24 18:05 whittssg

@whittssg It's this code which limits we can't use private IP https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/main/java/org/opensearch/ml/engine/httpclient/MLHttpClientFactory.java#L76

This is mainly for security concern. For example, if you have OpenSearch service and your other private internal service running on same server, and someone create a connector with your private service, then they can call your private service.

Suggest create a model service with a HTTP endpoint and authentication, then create a connector with the endpoint.

May 01 '24 19:05 ylwu-amzn

@ylwu-amzn thanks for the reply.. I am just curious how if we are running both the opensearch instance and LLM locally why there is a security concern (i can understand for the hosted stuff) ?

Is there some documentation for your suggesiton:

"Suggest create a model service with a HTTP endpoint and authentication, then create a connector with the endpoint"

Thank you.

May 01 '24 19:05 whittssg

@whittssg , I haven't verified this idea, but it should theoretically work if you have a DNS or a load balancer set up in front of your model service. Don't use direct local IP should be ok.

May 01 '24 20:05 ylwu-amzn

The local IP is expected to be in the same private network but in a different server. How can we connect to it? I don't see any security concern in that especially that the OpenSearch admin is the one that is configuring it so he/she shows what he/she is doing

May 02 '24 13:05 asfoorial

We had a discussion with security guys, they are ok to add a setting for allowing private IP. So user can control whether enable it or not. The setting should be disabled by default. User can enable it if they need. That can solve the problem.

May 02 '24 16:05 ylwu-amzn

@ylwu-amzn glad to hear that. BTY, when this setting will be open to users?

May 06 '24 09:05 MonkeyKing-KK

@MonkeyKing-KK , It will be in 2.15. https://opensearch.org/releases.html

May 06 '24 19:05 ylwu-amzn

+1 on this! Looking forward to being able to host my own openai compatible models and use with this.

Jun 01 '24 02:06 dtaivpp

@ylwu-amzn haven't seen changes and 2.15. is planned to be released this week. any update?

Jun 11 '24 10:06 manzke

Just to add to all of this.. there is currently a very suitable SLM in onnx format which is Phi-3 (https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx and https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi-3-tutorial.md) It runs at a very decent speed in my own CPU-only 5 years old laptop and the accuracy is quite reasonable. I highly recommend to support hosting and integrating it natively within OpenSearch. I even recommend defining OpenSearch specific prompts to perform certain tasks. I remember in the community there were talks about hosting NER models. I believe that this, with a bit of predefined prompts will do the job very well. Supporting such model will give OpenSearch the edge over other open source alternatives.

Jun 11 '24 11:06 asfoorial

@ylwu-amzn haven't seen changes and 2.15. is planned to be released this week. any update?

The PR is out https://github.com/opensearch-project/ml-commons/pull/2534

Jun 11 '24 22:06 ylwu-amzn

Just to add to all of this.. there is currently a very suitable SLM in onnx format which is Phi-3 (https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx and https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi-3-tutorial.md) It runs at a very decent speed in my own CPU-only 5 years old laptop and the accuracy is quite reasonable. I highly recommend to support hosting and integrating it natively within OpenSearch. I even recommend defining OpenSearch specific prompts to perform certain tasks. I remember in the community there were talks about hosting NER models. I believe that this, with a bit of predefined prompts will do the job very well. Supporting such model will give OpenSearch the edge over other open source alternatives.

@asfoorial Thanks a lot for sharing this. Sounds doable. @jngz-es Can you do some research?

Jun 11 '24 22:06 ylwu-amzn

ml-commons ml-commons copied to clipboard

[FEATURE] Support the deployment of Small Language Models

ml-commons
ml-commons copied to clipboard