feat: add support for custom ai models
Since @nyannyacha had made a code refactor, maybe the original description does not reflect the architecture changes anymore, but you can still read it bellow:
Original description
What kind of change does this PR introduce?
Feature, Enhancement
What is the current behavior?
As described in the following discussion, currently edge-runtime only allows the coupled gte-small as feature-extraction model.
What is the new behavior?
This PR introduces a new Rust definition of a transformers like API to edge-runtime. So it allows to install different ONNX models at same time, as well contains the base API definitions to support other kinds of pipelines rather than just feature-extraction task.
This new API introduces the Pipeline class, that pretends to be similar as xenova/transformers but backed by Rust. The Pipeline class aim to be an evolution of the current Session class, where only support gte-small for feature extractions and ollama models. The new class defines APIs that allows:
- Different kinds of
tasks,feature-extraction(implemented),token-classification,sentiment-analysisand others ... - Usage of pre-installed models
- GPU support provided by
ort
GPU support was added in the ort-gpu-runtime branch, see also ort - Execution Providers
Docker Image:
You can get a docker image of this from docker hub:
# default runtime
docker pull kallebysantos/edge-runtime:latest
# gpu with cuda provider
docker pull kallebysantos/edge-runtime:latest-cuda
Pipeline usage:
To use the new Pipeline class is very simple and looks very similar to Session:
Using default gte-small, pre-installed by Supabase team:
// By default `feature-extraction` will map to `gte-small`
const pipe = new Supabase.ai.Pipeline('feature-extraction');
Deno.serve(async (req: Request) => {
const embeddings = await pipe.run("Hello World", { mean_pool: true, normalize: true });
return new Response(JSON.stringify(embeddings), { status: 200 });
});
Using different model:
// Just set the model name, similar to `xenova/transformers` or python `transformers`
const pipe = new Supabase.ai.Pipeline(
'feature-extraction',
'paraphrase-multilingual-MiniLM-L12-v2',
);
Deno.serve(async (req: Request) => {
const embeddings = await pipe.run("Hello World", { mean_pool: true, normalize: true });
return new Response(JSON.stringify(embeddings), { status: 200 });
});
Note: Custom models must be pre-installed inside the
edge-runtimecontainer. The pipeline class will not download it automatically.
Models folder architecture:
The models folder as changed a little bit to support more than one model/task installed at same time. This new structure aims to be more organized and robust based on the following scheme:
models
├── defaults
│ ├── feature-extraction -> ../gte-small
│ └── token-classification -> ../ner-bert-large-cased-pt-lenerbr-onnx
├── gte-small
│ ├── model.onnx
│ └── tokenizer.json
├── ner-bert-large-cased-pt-lenerbr-onnx
│ └── ...
└── paraphrase-multilingual-MiniLM-L12-v2
└── ...
Where models are installed in the root, and then referenced as symbolic links in the defaults folder. The defaults folder maps the current default model to a specific task. Symbolic links are used to reduce disk space and file duplication.
Then the name of the task as well the model should be the same of the folders, to simplify the model loading by the runtime.
So, in order to install a new model, you just need to download it to models folder.
Then to change de default model of an specific task, you just need to change the symbolic link.
In order to simplify the model installation the download_models.sh script has been changed a little, and now supports the model name as input argument:
./scripts/download_models.sh "Supabase/gte-small" "feature-extraction"
Setting
feature-extractiondefaults togte-small
./scripts/download_models.sh "hf-user/model-name"
Custom model, example
Xenova/paraphrase-multilingual-MiniLM-L12-v2
The Pipeline API:
The pipeline mod allows developers to create new kinds of tasks - Currently just the feature-extraction has been implemented. This architecture defines the base blocks to extend and add more support for ai.
In order to create new pipelines, developers must impl the Pipeline trait like the following:
Feature Extraction implementation
use anyhow::Error;
use ndarray::{Array1, Axis, Ix3};
use ndarray_linalg::norm::{normalize, NormalizeAxis};
use ort::{inputs, Session};
use std::sync::Arc;
use tokenizers::Tokenizer;
use tokio::sync::mpsc::{self, UnboundedReceiver, UnboundedSender};
use tokio::sync::Mutex;
use crate::tensor_ops::mean_pool;
use super::{Pipeline, PipelineInput, PipelineRequest};
pub type FeatureExtractionResult = Vec<f32>;
#[derive(Debug)]
pub(crate) struct FeatureExtractionPipelineInput {
pub prompt: String,
pub mean_pool: bool,
pub normalize: bool,
}
impl PipelineInput for FeatureExtractionPipelineInput {}
#[derive(Debug)]
pub struct FeatureExtractionPipeline {
sender:
UnboundedSender<PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>>,
receiver: Arc<
Mutex<
UnboundedReceiver<
PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>,
>,
>,
>,
}
impl FeatureExtractionPipeline {
pub fn init() -> Self {
let (sender, receiver) = mpsc::unbounded_channel::<
PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>,
>();
Self {
sender,
receiver: Arc::new(Mutex::new(receiver)),
}
}
}
impl Pipeline<FeatureExtractionPipelineInput, FeatureExtractionResult>
for FeatureExtractionPipeline
{
fn get_sender(
&self,
) -> UnboundedSender<PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>>
{
self.sender.to_owned()
}
fn get_receiver(
&self,
) -> Arc<
Mutex<
UnboundedReceiver<
PipelineRequest<FeatureExtractionPipelineInput, FeatureExtractionResult>,
>,
>,
> {
Arc::clone(&self.receiver)
}
fn run(
&self,
session: &Session,
tokenizer: &Tokenizer,
input: &FeatureExtractionPipelineInput,
) -> Result<FeatureExtractionResult, Error> {
let encoded_prompt = tokenizer
.encode(input.prompt.to_owned(), true)
.map_err(anyhow::Error::msg)?;
let input_ids = encoded_prompt
.get_ids()
.iter()
.map(|i| *i as i64)
.collect::<Vec<_>>();
let attention_mask = encoded_prompt
.get_attention_mask()
.iter()
.map(|i| *i as i64)
.collect::<Vec<_>>();
let token_type_ids = encoded_prompt
.get_type_ids()
.iter()
.map(|i| *i as i64)
.collect::<Vec<_>>();
let input_ids_array = Array1::from_iter(input_ids.iter().cloned());
let input_ids_array = input_ids_array.view().insert_axis(Axis(0));
let attention_mask_array = Array1::from_iter(attention_mask.iter().cloned());
let attention_mask_array = attention_mask_array.view().insert_axis(Axis(0));
let token_type_ids_array = Array1::from_iter(token_type_ids.iter().cloned());
let token_type_ids_array = token_type_ids_array.view().insert_axis(Axis(0));
let outputs = session.run(inputs! {
"input_ids" => input_ids_array,
"token_type_ids" => token_type_ids_array,
"attention_mask" => attention_mask_array,
}?)?;
let embeddings = outputs["last_hidden_state"].extract_tensor()?;
let embeddings = embeddings.into_dimensionality::<Ix3>()?;
let result = if input.mean_pool {
mean_pool(embeddings, attention_mask_array.insert_axis(Axis(2)))
} else {
embeddings.into_owned().remove_axis(Axis(0))
};
let result = if input.normalize {
let (normalized, _) = normalize(result, NormalizeAxis::Row);
normalized
} else {
result
};
Ok(result.view().to_slice().unwrap().to_vec())
}
}
GPU Support:
The gpu support allows pipeline inference in specialized hardware and its backed with CUDA. There is no configuration to do by the final user, just call the Pipeline has described before. But in order to enable gpu inference the Dockerfile was changed a little and now has two main build stages (That must be explicitly specified during docker build) :
edge-runtime (CPU only):
This stage builds the default edge-runtime, where ort::Session's are loaded using CPU
docker build --target "edge-runtime" .
edge-runtime-cuda (GPU/CPU):
This stage builds the default edge-runtime in a nvidia/cuda machine that allows loading using GPU or CPU(as fallback).
docker build --target "edge-runtime-cuda" .
Each stage needs to install the appropriated
onnx-runtime. So in order that, theinstall_onnx.shhas updated with a 4º parameter flag--gpu, that will download acudaversion from the officialmicrosoft/onnxruntimerepository.
Using GPU image:
In order to use the gpu image the docker-compose file must include the following properties for the functions service:
services:
functions:
image: supabase/edge-runtime:latest-cuda
# Built was describe before, or directly by compose
build:
context: .
dockerfile: Dockerfile
target: edge-runtime-cuda
# Required to use gpu inside the container
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1 # Change here if more devices are installed
capabilities: [gpu]
IMPORTANT NOTE: The target infrastructure must be prepared with
NVIDIA Container Toolkitto allow gpu support inside docker
Final considerations:
There are a loot of work to do, in order to have more kinds of tasks as well improve the Pipeline API to avoid code duplication. But I think that PR brings an standard pattern to improve AI support to edge-runtime reducing the costs for container warm up and giving more flexibility to change and test different models.
Updates and tests:
Inference Performance
Answer by @nyannyacha where describe some solutions.
Since I made this PR, I'd been testing it with my custom docker image and I figure out some performance issues.
In my create_session implementation I tried to follow the same structure of the original code from Supabase team but I figure out that sometimes the container crashes 🔥.
My theory is the Session:.builder() has creating multiple Sections instances instead of reusing then, but I don't know if it was supposed to do that. For edge environment, I think that ok since every request runs in a separated cold start container. But for self-host it has a huge performance impact due CPU utilization issues. I would like if someone could help me with that.
GPU Support (check ort-gpu-runtime )
Already added to this PR, check the GPU Support section above.
In the previous section I reported some CPU performance issues, so I started to search about GPU utilization.
GPU inference setup
So I move on with the following changes in the create_session function:
let builder = Session::builder()?
.with_optimization_level(GraphOptimizationLevel::Level3)?
.with_intra_threads(orm_threads)?;
let cuda = CUDAExecutionProvider::default();
println!("CUDA: {:?}", cuda.is_available());
if let Err(error) = cuda.register(&builder) {
eprintln!("Failed to register CUDA!: {:?}", error);
std::process::exit(1); // Forcing CUDA just for testing, in production I recommend to use provider fallbacks
}
let session = builder.commit_from_file(model_file_path)?;
Ok(session)
Then update the docker image with CUDA support
Dockerfile code
# Docker image with rust changes for GPU support
FROM kallebysantos/edge-runtime:latest-cuda AS build
RUN apt update && apt install curl wget -y
WORKDIR /etc/sb_ai
# Custom model donwloading
RUN curl -L -o ./download_models.sh https://raw.githubusercontent.com/kallebysantos/edge-runtime/main/scripts/download_models.sh
RUN chmod +x ./download_models.sh
RUN ./download_models.sh "Xenova/paraphrase-multilingual-MiniLM-L12-v2" "feature-extraction" --quantized
# Getting onnxruntime with cuda support
RUN wget -qO- 'https://github.com/microsoft/onnxruntime/releases/download/v1.18.1/onnxruntime-linux-x64-gpu-1.18.1.tgz' | tar zxv
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 AS run
COPY --from=build /etc/sb_ai/models /etc/sb_ai/models
COPY --from=build /usr/local/bin/edge-runtime /usr/local/bin/edge-runtime
COPY --from=build /etc/sb_ai/onnxruntime-linux-x64-gpu-1.18.1 /usr/local/bin/onnxruntime
ENV SB_AI_MODELS_DIR=/etc/sb_ai/models
ENV ORT_DYLIB_PATH=/usr/local/bin/onnxruntime/lib/libonnxruntime.so
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
ENTRYPOINT ["edge-runtime"]
Embedding from DB
With the GPU container started, I tried to embedding all the missing document_sections of my database with the following function:
code
-- Function that can be trigged manually
create or replace function private.apply_embed_document_sections(
batch_size int default 5,
timeout_milliseconds int default 5 * 60 * 1000
)
returns void
language plpgsql
security definer
as $$
declare
batch_count int = ceiling((
select count(*) from document_sections where (ai).embedding is null
) / batch_size::float);
begin
SET search_path = 'public','extensions';
-- Loop through each batch and invoke an edge function to handle the embedding generation
for i in 0 .. (batch_count-1) loop
perform
net.http_request(
method := 'POST',
url := vault.get('supabase_url') || '/functions/v1/document_sections/embed',
body := jsonb_build_object(
'ids', (select json_agg(id) from (
select id from document_sections
where (ai).embedding is null
order by id limit batch_size offset i*batch_size
) as ds)
),
headers := jsonb_build_object(
'Content-Type', 'application/json',
'Authorization','Bearer ' || vault.get('supabase_anon_key')
),
timeout_milliseconds := timeout_milliseconds
);
end loop;
end;
$$;
GPU performance
Then I figured out that my theory was true, watching to nvidia-smi I could see that GPU memory utilization has been overflowed by multiples model instances and after some time I need to restart the container.
logs
GPU Overflow Note: the model should only use ~200Mb
What kind of change does this PR introduce?
Feature, Enhancement
What is the current behavior?
As described in the following discussion, currently edge-runtime only allows the coupled gte-small as feature-extraction model.
What is the new behavior?
This PR introduces a new Rust definition of a transformers like API to edge-runtime. So it allows to install different ONNX models at same time, as well contains the base API definitions to support other kinds of pipelines rather than just feature-extraction task.
This new API introduces the Pipeline class, that pretends to be similar as xenova/transformers but backed by Rust. The Pipeline class aim to be an evolution of the current Session class, where only support gte-small for feature extractions and ollama models. The new class defines APIs that allows:
- Different kinds of
taskslikefeature-extraction,sentiment-analysisand others... - Usage of pre-loaded models (build time).
- Model fetching and cache (runtime).
- GPU support provided by
ort. - Lazy loading of multiples ort's sessions at same time.
👷 It also is a base foundation for something even bigger like a hosting platform for custom models aka MLOps.
Tester docker image:
You can get a docker image of this PR from docker hub:
# default runtime
docker pull kallebysantos/edge-runtime:latest
# gpu with cuda provider
docker pull kallebysantos/edge-runtime:latest-cuda
Pipeline usage:
To use the new Pipeline class is very simple and looks very similar to Session:
Using default gte-small, pre-installed by Supabase team:
Typescript support provided by add type definition for Pipeline API #91
const pipe = new Supabase.ai.Pipeline('gte-small'); // 'gte-small' or 'supabase-gte'
// 'supabase-gte' | 'gte-small' is only backward compatibility for 'feature-extraction' default variant.
Deno.serve(async (req: Request) => {
const embeddings = await pipe.run("Hello World");
return new Response(JSON.stringify(embeddings), { status: 200 });
});
Using different model:
Custom models can be used as variant of some task. The pipeline will auto-fetch them at runtime.
But for it works, the downloadable URLs must be included at
environment variables.
# export SB_AI_MODEL_URL_<task>_<variant_name>="...hf.co/../model.onnx"
# export SB_AI_TOKENIZER_URL_<task>_<variant_name>="...hf.co/../tokenizer.json"
# export SB_AI_CONFIG_URL_<task>_<variant_name>="...hf.co/../config.json"
export SB_AI_MODEL_URL_FEATURE_EXTRACTION_MULTILINGUAL_MINI_LM='https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/onnx/model_quantized.onnx?download=true'
export SB_AI_TOKENIZER_URL_FEATURE_EXTRACTION_MULTILINGUAL_MINI_LM='https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/tokenizer.json?download=true'
export SB_AI_CONFIG_URL_FEATURE_EXTRACTION_MULTILINGUAL_MINI_LM='https://huggingface.co/Xenova/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/config.json?download=true'
./scripts/run.sh
Then use it as task variant:
// Just set the variant name, similar to `xenova/transformers` or python `transformers`
const pipe = new Supabase.ai.Pipeline('feature-extraction', 'multilingual-Mini-LM');
Deno.serve(async (req: Request) => {
const embeddings = await pipe.run("Hello World");
return new Response(JSON.stringify(embeddings), { status: 200 });
});
Available tasks:
At this point there are 3 kind of tasks implemented in this PR, many others will coming after.
Feature extraction, see details here
The default variant model for this task is Supabase/gte-small.
Also, supabase-gte and gte-small are aliases for feature-extraction in default variant.
Example 1: Instantiate pipeline using the Pipeline class.
const extractor = new Supabase.ai.Pipeline('feature-extraction');
const output = await extractor('This is a simple test.');
// output: [0.05939, 0.02165, ...]
Example 2: Batch inference, processing multiples in parallel
const extractor = new Supabase.ai.Pipeline('feature-extraction');
const output = await extractor(["I'd use Supabase in all of my projects", "Just a test for embedding"]);
// output: [[0.07399, 0.01462, ...], [-0.08963, 0.01234, ...]]
Text classification, see details here
The default variant model for this task is Xenova/distilbert-base-uncased-finetuned-sst-2-english.
Also, sentiment-analysis is name alias for text-classification.
Example 1: Instantiate pipeline using the Pipeline class.
const classifier = new Supabase.ai.Pipeline('text-classification');
const output = await classifier('I love Supabase!');
// output: {label: 'POSITIVE', score: 1.00}
Example 2: Batch inference, processing multiples in parallel
const classifier = new Supabase.ai.Pipeline('sentiment-analysis');
const output = await classifier(['Cats are fun', 'Java is annoying']);
// output: [{label: 'POSITIVE', score: 0.99 }, {label: 'NEGATIVE', score: 0.97}]
Zero shot classification, see details here
The default variant model for this task is Xenova/distilbert-base-uncased-mnli.
Example 1: Instantiate pipeline using the Pipeline class.
const classifier = new Supabase.ai.Pipeline('zero-shot-classification');
const output = await classifier({
text: 'one day I will see the world',
candidate_labels: ['travel', 'cooking', 'exploration']
});
// output: [{label: "travel", score: 0.797}, {label: "exploration", score: 0.199}, {label: "cooking", score: 0.002}]
Example 2: Handling multiple correct labels
const classifier = new Supabase.ai.Pipeline('zero-shot-classification');
const input = {
text: 'one day I will see the world',
candidate_labels: ['travel', 'cooking', 'exploration']
};
const output = await classifier(input, { multi_label: true });
// output: [{label: "travel", score: 0.994}, {label: "exploration", score: 0.938}, {label: "cooking", score: 0.001}]
Example 3: Custom hypothesis template
const classifier = new Supabase.ai.Pipeline('zero-shot-classification');
const input = {
text: 'one day I will see the world',
candidate_labels: ['travel', 'cooking', 'exploration']
};
const output = await classifier(input, { hypothesis_template: "This example is NOT about {}");
// output: [{label: "cooking", score: 0.47}, {label: "exploration", score: 0.26}, {label: "travel", score: 0.26}]
The cache folder:
By default all task assets will be cached to ort-sys::internal::dirs::cache_dir() using the following folder structure:
.cache
├── dfbin
│ └── <ort-runtime-lib>
├── models
│ ├── <hash of the model url from environment variable>
│ │ ├── model.onnx
│ │ └── model.onnx.checksum
│ └── ...
├── configs
│ ├── <hash of the config url from environment variable>
│ │ ├── config.json
│ │ └── config.json.checksum
│ └── ...
└── tokenizers
├── <hash of the tokenizer url from environment variable>
│ ├── tokenizer.json
│ └── tokenizer.json.checksum
└── ...
Since cache folders are resolved based on the downloadable URLs, if some URL change the pipeline will refetch it. In order to preserve cold start on docker containers, consider mapping this folder to an external volume. Assets can be pre-loaded during
docker buildusing thepreload_defs bin.
GPU Support:
The gpu support allows pipeline inference in specialized hardware and its backed with CUDA. There is no configuration to do by the final user, just call the Pipeline has described before. But in order to enable gpu inference the Dockerfile now has two main build stages (That must be explicitly specified during docker build) :
edge-runtime (CPU only):
This stage builds the default edge-runtime, where ort::Session's are loaded using CPU.
docker build --target "edge-runtime" .
Resulting image with ~150 Mb
edge-runtime-cuda (GPU/CPU):
This stage builds the default edge-runtime in a nvidia/cuda machine that allows loading using GPU or CPU(as fallback).
docker build --target "edge-runtime-cuda" .
Resulting image with ~2.20 GB
Each stage needs to install the appropriated onnx-runtime. So in order that, the install_onnx.sh has updated with a 4º parameter flag --gpu, that will download a cuda version from the official microsoft/onnxruntime repository.
Using GPU image:
In order to use the gpu image the docker-compose file must include the following properties for the functions service:
services:
functions:
# Built was describe before
image: supabase/edge-runtime:latest-cuda
# or directly by compose
build:
context: .
dockerfile: Dockerfile
target: edge-runtime-cuda
# Required to use gpu inside the container
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1 # Change here if more devices are installed
capabilities: [gpu]
IMPORTANT NOTE: The target infrastructure must be prepared with
NVIDIA Container Toolkitto allow gpu support inside docker
Final considerations:
There are a loot of work to do, in order to have more kinds of tasks as well improve the Pipeline API. But I think that PR brings an standard pattern to improve AI support to edge-runtime reducing the costs for container warm up and giving more flexibility to change and test different models.
Finally, thanks for @nyannyacha that help me a loot 🙏
cc @laktek
This is amazing! Thanks for contributing @kallebysantos.
Please allow us couple of days to go through the changes and get back to you on how to proceed.
Hi guys!!
I had push a docker image with the current changes kallebysantos/edge-runtime:latest
@kallebysantos I started experimenting with your branch locally and got few questions would like to discuss with you (more product related than the code). Can you please drop me an email lakshan [at] supabase [dot] com?
@kallebysantos I started experimenting with your branch locally and got few questions would like to discuss with you (more product related than the code). Can you please drop me an email lakshan [at] supabase [dot] com?
Yes, sure 🙏
@kallebysantos Just wanted to let you know I'll be on leave for 4 weeks starting from next week. We'll make a decision on how to proceed with this after I return. Meanwhile, other contributors will review and test the PR.
Hello @kallebysantos 😄
Since @laktek is on leave, I'm going to help with the review of this PR.
(If it's okay with you, I hope you will allow me to push changes to PR to speed up reviews. 😋)
Hello @kallebysantos 😄
Since @laktek is on leave, I'm going to help with the review of this PR.
(If it's okay with you, I hope you will allow me to push changes to PR to speed up reviews. 😋)
Hi @nyannyacha, it sound ok for me, feel free to help 🤗
I'd report some "issues/questions" in the Updates and tests section of my PR, could you have a look in that? I know that maybe the session instance can use a Singleton pattern that should solve it, but I don't know if it was supposed to be, since Supabase team aims to host it on the edge and I don't have sure the impacts of a Singleton instance in the edge. I was thinking in something like a HashSet that handles the (model_name, session) but I need to study more about the Deno runtime to know here global/shared things should be stored. Also some way to terminate the session after a while, specially when using CudaProvider, on CPU I notice that workers already terminate the session.
@kallebysantos
Since I made this PR, I'd been testing it with my custom docker image and I figure out some performance issues. In my create_session implementation I tried to follow the same structure of the original code from Supabase team but I figure out that sometimes the container crashes 🔥.
I don't know what error code your container terminated with, but we've seen containers terminate due to SIGSEGV in addition to containers terminating due to system memory or GPU memory exhaustion.
If it also happened to your container, it was probably a problem with get_onnx_env in auto_pipeline.rs.
pub(crate) fn get_onnx_env() -> Lazy<Option<ort::Error>> {
Lazy::new(|| {
// Create the ONNX Runtime environment, for all sessions created in this process.
// TODO: Add CUDA execution provider
if let Err(err) = ort::init().with_name("SB_AI_ONNX").commit() {
error!("sb_ai: failed to create environment - {}", err);
return Some(err);
}
None
})
}
If I remember correctly, ort::init()...commit() assigns ort::Environment to a global static variable (G_ENV), so there shouldn't be any further assignments to that variable after this.
However, since get_onnx_env transfers ownership of the return value to a local variable on the caller's stack rather than assigning it to a static variable, we can expect ort::init()...commit() to be called every time this function is called.
I know that maybe the session instance can use a Singleton pattern that should solve it, but I don't know if it was supposed to be, since Supabase team aims to host it on the edge and I don't have sure the impacts of a Singleton instance in the edge.
I've had similar thoughts to yours before, and I've pushed some commits for this to my fork before this review. (This fork branch also contains commits for a few individual things that seem to need improvement in the scope of the sb_ai crate.)
Could you take a look?
https://github.com/nyannyacha/edge-runtime/commits/ai https://github.com/nyannyacha/edge-runtime/commit/bc28819a6d941f143878d84c8b85e1c077b76360 https://github.com/nyannyacha/edge-runtime/commit/444790335a2d74282b02ad22dcce63f99f2fb195
@nyannyacha
I've had similar thoughts to yours before, and I've pushed some commits for this to my fork before this review. (This fork branch also contains commits for a few individual things that seem to need improvement in the scope of the
sb_aicrate.)Could you take a look?
https://github.com/nyannyacha/edge-runtime/commits/ai nyannyacha@bc28819 nyannyacha@4447903
I'd replicate you ai in mine ai branch to do some tests with your changes on create_session. But now I can't figure out why that branch just don't compiles? I think that's something related with the base crate and this snapshot thing. I assume that is trying to execute as runtime during compilation.
@kallebysantos
That failure is related to the last commit I pushed to this PR.
Github Test action does not have the onnx runtime library, but the panic is caused by the ctor macro trying to initialize onnx environment.
Hi @nyannyacha, I'd follow your code changes in order to implement the session optimization for Pipeline.
During my tests (Processing 10k of embeddings in batches of 10), the combination of my changes with yours (bc28819a6d941f143878d84c8b85e1c077b76360) solved the memory issues and container crashing.
-
Container crashing: Before your fixes, my container blow up every first request (I thought that probably was the deno caching that should runs once, but was being called twice due parallel requests at same time).
-
GPU: I could see the reutilization of
ort::Sessionfrom my changes increate_session, that results in just ~600Mib of GRAM.
Before my container was crashing a lot, now just crash 1/2 times. But I assume that was because the machine weak and was processing a lot of requests at same time and looking into
htopI could see my CPU at max.
I'd to implement a HashMap over the Lazy/static thing, because the Pipelines should handle models in a generic way, so it's impossible to now the model path as constant value. You can check it in these commits:
- 122bd30ff7766ecd6266e6e70faacd0d3806b026
- 24a3028dfa82adfa5f838de3d0d2409d45f3080e
I think that next step is look to some way to drop unused Sessions after a while.
Maybe some smart pointer could help me with that. But for my company use case, theses optimizations should be enough.
Also I'll need to move further to add more kinds of pipelines like ner/token-classification.
I saw edge-transformers and rust-bert repositories, that maybe we can follow their implementations.
Hello @kallebysantos 😉
I'd follow your code changes in order to implement the session optimization for Pipeline.
When you say the session optimization, do you mean this commit?
the combination of my changes with yours (https://github.com/supabase/edge-runtime/commit/bc28819a6d941f143878d84c8b85e1c077b76360) solved the memory issues and container crashing.
Glad to hear that the commit worked for you. 😁
Before my container was crashing a lot, now just crash 1/2 times.
Hmm... this is a little weird, can you find out what error code crashed the container? If it was caused by an error like SIGSEGV, that's serious and we should definitely look for it. Conversely, if it was caused by an OOM (out of memory), that makes sense since you said your machine is weak.
...However, it's hard to think of a case where the edge runtime crashes simply because CPU utilization reaches max. 🧐
I'd to implement a HashMap over the Lazy/static thing, because the Pipelines should handle models in a generic way, so it's impossible to now the model path as constant value. You can check it in these commits:
Yeah, this is a good idea. I looked at the two commits you linked to and they look fine. I'd like you to add that change to this PR. 😊
I think that next step is look to some way to drop unused Sessions after a while. Maybe some smart pointer could help me with that. But for my company use case, theses optimizations should be enough.
Since the Session is wrapped in an Arc<T>, we should be able to see a strong_count. This should allow us to write simple LRU or timer-based cleanup logic.
Also I'll need to move further to add more kinds of pipelines like ner/token-classification. I saw edge-transformers and rust-bert repositories, that maybe we can follow their implementations.
I don't have any immediate thoughts on this, but I think it would take a lot of time to merge if we add a lot of changes to this PR, so it might not be a bad idea to start this on another PR after this PR merge. 😁
Hello @nyannyacha, thanks for your considerations 🙏.
I'd push the optimize session commit to the main branch.
Hmm... this is a little weird, can you find out what error code crashed the container? If it was caused by an error like SIGSEGV, that's serious and we should definitely look for it. Conversely, if it was caused by an OOM (out of memory), that makes sense since you said your machine is weak.
I can't know what error it is, the only thing that happens is supabase-edge-functions exited with code 0. I think that is related with the weak hardware, like you said before the RAM memory just fly to the moon 🚀 you can see it from my htop logs bellow.
Memory blowing up 🚀🔥
For a production environment I'm planning to run it in a better machine with Docker swarm using replicas and boundary limits.
Container start
Start processing pending sections
select count(*) from document_sections
where (ai).embedding is null;
-- Pending sections count: 17,771
select count(*) from net.failed_requests
-- Nothing here
select private.apply_embed_document_sections();
-- Start performing requests in batches of 5
Server handling a loot of requests
As you can see, in this machine the
CPU goes up to 100%
We can also watch that just 1 session are initialized and then is reused. 🤗
Container chash
Then, after a while the container just exit 0
That was the final results, after existing:
select count(*) from document_sections
where (ai).embedding is null;
-- Pending sections count: 14,578
select count(*) from net.failed_requests
-- Failed requests: 1,298
Since my compose will always restart the container, it still processing after reboot. And I have some pg_cron jobs to retry failed requests. My example is very massive, I'd trigger embed manually for all pending sections, but in prod it will be called automatically by pg triggers after insert a document, so the quantity of embeddings to process will be very small.
@kallebysantos
I'm confused that the container has an exit code of 0 😵💫
This is because this would indicate that the edge runtime process was gracefully terminated, but your screenshot describes the exact opposite.
Can you try the command below immediately after the container crashes to see if it was OOMKilled?
docker container inspect <edge functions container id>
Example output
...
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false, // 👈
...
Hey @kallebysantos
I added a few more commits.
Previously, the error code was returned as 0 because edge runtime consumed the received unix signal and did not expose it out of process.
Please test it in your environment and let me know how it goes. 😋
Note:
This PR includes improvements to some bottlenecks for the inference task.
Load Testing (main vs PR-368)
Hardware Information
Hardware:
Hardware Overview:
Model Name: MacBook Air
Model Identifier: MacBookAir10,1
Chip: Apple M1
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB
Run # 1 (cold-start)
The RSS metric shown in htop in the screenshots below is the result after malloc_trim(0) is called.
main
PR-368
Run # 2 (warm-up)
k6 only
main
vscode ➜ /workspaces/edge-runtime (main-k6) $ k6 run --summary-trend-stats "min,avg,med,max,p(95),p(99),p(99.99)" ./k6/dist/specs/gte.js
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: ./k6/dist/specs/gte.js
output: -
scenarios: (100.00%) 1 scenario, 12 max VUs, 3m30s max duration (incl. graceful stop):
* simple: 12 looping VUs for 3m0s (gracefulStop: 30s)
✗ status is 200
↳ 98% — ✓ 4832 / ✗ 95
✗ request cancelled
↳ 1% — ✓ 95 / ✗ 4832
checks.........................: 50.00% ✓ 4927 ✗ 4927
data_received..................: 724 kB 3.8 kB/s
data_sent......................: 1.7 MB 8.7 kB/s
http_req_blocked...............: min=416ns avg=15.44µs med=2.2µs max=3.12ms p(95)=6.66µs p(99)=411.61µs p(99.99)=2.9ms
http_req_connecting............: min=0s avg=10.82µs med=0s max=3.04ms p(95)=0s p(99)=325.58µs p(99.99)=2.84ms
http_req_duration..............: min=84.68ms avg=439.51ms med=401.96ms max=1.47s p(95)=869.7ms p(99)=1.21s p(99.99)=1.47s
{ expected_response:true }...: min=84.68ms avg=427.74ms med=397.04ms max=1.47s p(95)=796.38ms p(99)=1.13s p(99.99)=1.47s
http_req_failed................: 1.92% ✓ 95 ✗ 4832
http_req_receiving.............: min=5.66µs avg=76.7µs med=50.45µs max=2.75ms p(95)=238.18µs p(99)=685.38µs p(99.99)=2.69ms
http_req_sending...............: min=2.37µs avg=18.9µs med=14.33µs max=6.3ms p(95)=33.61µs p(99)=76.57µs p(99.99)=3.78ms
http_req_tls_handshaking.......: min=0s avg=0s med=0s max=0s p(95)=0s p(99)=0s p(99.99)=0s
http_req_waiting...............: min=84.59ms avg=439.42ms med=401.92ms max=1.47s p(95)=869.66ms p(99)=1.21s p(99.99)=1.47s
http_reqs......................: 4927 25.568987/s
iteration_duration.............: min=84.87ms avg=441.96ms med=402.21ms max=11.39s p(95)=870.54ms p(99)=1.21s p(99.99)=6.5s
iterations.....................: 4927 25.568987/s
vus............................: 12 min=0 max=12
vus_max........................: 12 min=0 max=12
running (3m12.7s), 00/12 VUs, 4927 complete and 0 interrupted iterations
simple ✓ [======================================] 12 VUs 3m0s
PR-368
vscode ➜ /workspaces/edge-runtime (ai) $ k6 run --summary-trend-stats "min,avg,med,max,p(95),p(99),p(99.99)" ./k6/dist/specs/gte.js
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: ./k6/dist/specs/gte.js
output: -
scenarios: (100.00%) 1 scenario, 12 max VUs, 3m30s max duration (incl. graceful stop):
* simple: 12 looping VUs for 3m0s (gracefulStop: 30s)
✗ status is 200
↳ 97% — ✓ 8252 / ✗ 180
✗ request cancelled
↳ 2% — ✓ 180 / ✗ 8252
checks.........................: 50.00% ✓ 8432 ✗ 8432
data_received..................: 1.2 MB 6.5 kB/s
data_sent......................: 2.8 MB 15 kB/s
http_req_blocked...............: min=500ns avg=12.4µs med=2.75µs max=7.95ms p(95)=7.06µs p(99)=198.65µs p(99.99)=3.91ms
http_req_connecting............: min=0s avg=7.61µs med=0s max=7.85ms p(95)=0s p(99)=124.92µs p(99.99)=3.84ms
http_req_duration..............: min=109.28ms avg=256ms med=245.92ms max=596.69ms p(95)=376.07ms p(99)=441.57ms p(99.99)=588.84ms
{ expected_response:true }...: min=109.28ms avg=256.11ms med=245.8ms max=596.69ms p(95)=376.26ms p(99)=441.66ms p(99.99)=589.01ms
http_req_failed................: 2.13% ✓ 180 ✗ 8252
http_req_receiving.............: min=8.54µs avg=86.39µs med=62.56µs max=10.49ms p(95)=134.2µs p(99)=463.11µs p(99.99)=5.89ms
http_req_sending...............: min=3.2µs avg=24.54µs med=16.5µs max=5.35ms p(95)=36.58µs p(99)=77.53µs p(99.99)=4.28ms
http_req_tls_handshaking.......: min=0s avg=0s med=0s max=0s p(95)=0s p(99)=0s p(99.99)=0s
http_req_waiting...............: min=109.21ms avg=255.89ms med=245.83ms max=596.62ms p(95)=375.97ms p(99)=441.43ms p(99.99)=588.74ms
http_reqs......................: 8432 44.0193/s
iteration_duration.............: min=109.44ms avg=257.51ms med=246.24ms max=10.78s p(95)=376.32ms p(99)=441.92ms p(99.99)=2.19s
iterations.....................: 8432 44.0193/s
vus............................: 12 min=0 max=12
vus_max........................: 12 min=0 max=12
running (3m11.6s), 00/12 VUs, 8432 complete and 0 interrupted iterations
simple ✓ [======================================] 12 VUs 3m0s