[RFC] Model management in Runtime
Summary
The current runtime provides the ability to download model files from different sources, but lacks management capabilities for model files.
Motivation
This issue aims to provide model file management capabilities on the runtime, and capacity building will be carried out from the following aspects:
- Maintain model by RESTful API: We can control the runtime to download or clean model files. Related issues: #196, #49
- Visibility of all models:
- Manage model files in different paths: The model files may be downloaded to different directories, and we need to track the model files in different directories.
- Distinguish the status of model files: The model file has different states, such as
downloaded,downloading,deletingetc, we can use these states to do more things. Related issues: #454
- Observability: It is better to monitor the performance of the download phase.
Proposed Change
No response
Alternatives Considered
No response
Alternatives Considered
- Distinguish the status of model files
Currently, it is possible to distinguish whether a file has been downloaded through the
metadatafile under.cache/. However, it is impossible to distinguish whether the file is in the downloading state or not. Is it possible to use a tool similar tolsof(as shown in the following figure) to determine if files in the download directory are owned by other processes (in the process of downloading and withoutmetadatafile under.cache/).
@Jeffwan Please help check if this plan is suitable, or if there is a better solution available?
Currently, it is possible to distinguish whether a file has been downloaded through the metadata file under .cache/. However, it is impossible to distinguish whether the file is in the downloading state or not.
I didn't quite understand this part. what's the state machine? 1st statement makes sense, do you mean model or file? what's the difference between 1st and 2nd statement?
I feel there're few ways, let's explore them together with lsof
-
- create
.tmpfile and do rename once finished
- create
-
- create separate
xxx.lockand remove it once finished
- create separate
-
- checksum. We already talk about it and it's expensive.
Let's have a short discussion offline
It seems that Huggingface used xxx.lock file with FileLock in filelock to control the locking during the download process. cc @Jeffwan
refs: huggingface source code