aibrix [RFC] Model management in Runtime

Summary

The current runtime provides the ability to download model files from different sources, but lacks management capabilities for model files.

Motivation

This issue aims to provide model file management capabilities on the runtime, and capacity building will be carried out from the following aspects:

Maintain model by RESTful API: We can control the runtime to download or clean model files. Related issues: #196, #49
Visibility of all models:
1. Manage model files in different paths: The model files may be downloaded to different directories, and we need to track the model files in different directories.
2. Distinguish the status of model files: The model file has different states, such as downloaded, downloading, deleting etc, we can use these states to do more things. Related issues: #454
Observability: It is better to monitor the performance of the download phase.

Proposed Change

No response

Alternatives Considered

No response

Dec 11 '24 10:12 brosoul

Alternatives Considered

Distinguish the status of model files Currently, it is possible to distinguish whether a file has been downloaded through the metadata file under .cache/. However, it is impossible to distinguish whether the file is in the downloading state or not. Is it possible to use a tool similar to lsof (as shown in the following figure) to determine if files in the download directory are owned by other processes (in the process of downloading and without metadata file under .cache/).

@Jeffwan Please help check if this plan is suitable, or if there is a better solution available?

Dec 13 '24 03:12 brosoul

Currently, it is possible to distinguish whether a file has been downloaded through the metadata file under .cache/. However, it is impossible to distinguish whether the file is in the downloading state or not.

I didn't quite understand this part. what's the state machine? 1st statement makes sense, do you mean model or file? what's the difference between 1st and 2nd statement?

Dec 16 '24 19:12 Jeffwan

I feel there're few ways, let's explore them together with lsof

1. create .tmp file and do rename once finished
1. create separate xxx.lock and remove it once finished
1. checksum. We already talk about it and it's expensive.

Let's have a short discussion offline

Dec 16 '24 19:12 Jeffwan

It seems that Huggingface used xxx.lock file with FileLock in filelock to control the locking during the download process. cc @Jeffwan refs: huggingface source code

Dec 21 '24 05:12 brosoul