aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

[RFC] Model management in Runtime

Open brosoul opened this issue 1 year ago • 4 comments

Summary

The current runtime provides the ability to download model files from different sources, but lacks management capabilities for model files.

Motivation

This issue aims to provide model file management capabilities on the runtime, and capacity building will be carried out from the following aspects:

  1. Maintain model by RESTful API: We can control the runtime to download or clean model files. Related issues: #196, #49
  2. Visibility of all models:
    1. Manage model files in different paths: The model files may be downloaded to different directories, and we need to track the model files in different directories.
    2. Distinguish the status of model files: The model file has different states, such as downloaded, downloading, deleting etc, we can use these states to do more things. Related issues: #454
  3. Observability: It is better to monitor the performance of the download phase.

Proposed Change

No response

Alternatives Considered

No response

brosoul avatar Dec 11 '24 10:12 brosoul

Alternatives Considered

  1. Distinguish the status of model files Currently, it is possible to distinguish whether a file has been downloaded through the metadata file under .cache/. However, it is impossible to distinguish whether the file is in the downloading state or not. Is it possible to use a tool similar to lsof (as shown in the following figure) to determine if files in the download directory are owned by other processes (in the process of downloading and without metadata file under .cache/). image

@Jeffwan Please help check if this plan is suitable, or if there is a better solution available?

brosoul avatar Dec 13 '24 03:12 brosoul

Currently, it is possible to distinguish whether a file has been downloaded through the metadata file under .cache/. However, it is impossible to distinguish whether the file is in the downloading state or not.

I didn't quite understand this part. what's the state machine? 1st statement makes sense, do you mean model or file? what's the difference between 1st and 2nd statement?

Jeffwan avatar Dec 16 '24 19:12 Jeffwan

I feel there're few ways, let's explore them together with lsof

    1. create .tmp file and do rename once finished
    1. create separate xxx.lock and remove it once finished
    1. checksum. We already talk about it and it's expensive.

Let's have a short discussion offline

Jeffwan avatar Dec 16 '24 19:12 Jeffwan

It seems that Huggingface used xxx.lock file with FileLock in filelock to control the locking during the download process. cc @Jeffwan refs: huggingface source code

brosoul avatar Dec 21 '24 05:12 brosoul