ianvs
ianvs copied to clipboard
Large Language Model Edge Benchmark Suite: Implementation on KubeEdge-Ianvs
What would you like to be added/modified: A benchmark suite for large language models deployed at the edge using KubeEdge-Ianvs:
- Interface Design and Usage Guidelines Document;
- Implementation of NLP Large Language Models (LLMs) Benchmark Suite Based on Ianvs 2.1 Extensive support for mainstream industry benchmark dataset formats such as MMLU, CMMLU, and other open-source datasets. 2.2 Visualization of the LLMs invocation process, including console output, logging of task execution and monitoring, etc.
- Generation of Benchmark Testing Reports Based on Ianvs 3.1 Test at least three types of LLMs. 3.2 Present computation results of performance metrics such as ACC, Recall, F1, latency, bandwidth, etc., with metric dimensions referencing the national standard "Artificial Intelligence - Pretrained Models Part 2: Evaluation Metrics and Methods".
- (Advanced) Efficient Evaluation: Concurrent execution of tasks, automatic request and result collection.
- (Advanced) Integration of Benchmark Testing Suite into the LLMs Training Process.
Why is this needed: Due to the size of models and data, Large Language Models (LLMs) are often trained in the cloud. Simultaneously, due to concerns regarding commercial confidentiality or user privacy during the usage of LLMs, deploying LLMs on edge devices has gradually become a research hotspot. Quantization techniques for LLMs are enabling edge-side inference; however, the limited resources of edge devices have an impact on the inference latency and accuracy compared to cloud-based training of LLMs. Ianvs aims to conduct edge-side deployment benchmark tests for cloud-trained LLMs utilizing container resource management capabilities and edge-cloud synergy abilities.
Recommended Skills: TensorFlow/Pytorch, LLMs, Docker
Useful links: KubeEdge-Ianvs KubeEdge-Ianvs Benchmark Test Cases Building Edge-Cloud Synergy Simulation Environment with KubeEdge-Ianvs Artificial Intelligence - Pretrained Models Part 2: Evaluation Metrics and Methods Example LLMs Benchmark List Docker Resource Management
If anyone has questions regarding this issue, please feel free to leave a message here. We would also appreciate it if new members could introduce themselves to the community.
To complete this issue, does it mean that I need to have the corresponding GPU resources to run large models for project debugging? Additionally, I am aware of an outstanding project called OpenCompass that evaluates LLMs, but they used InternLM's mmengine project. In this issue, is it prefered to write one's own framework rather than importing libraries from other projects?
To complete this issue, does it mean that I need to have the corresponding GPU resources to run large models for project debugging? Additionally, I am aware of an outstanding project called OpenCompass that evaluates LLMs, but they used InternLM's mmengine project. In this issue, is it prefered to write one's own framework rather than importing libraries from other projects?
Yes, given Cuda's acceleration of neural network training and inference, I think you need basic NIVDIA GPU resources. However, we are thinking about simulating LLMs inference on edge nodes (e.g. smartphones), so I don't think you need the support of e.g. A100 GPU resources. Nowadays, most of the SoC unified memory of a typical smartphone is 8GB or 16GB, so I think any GPU resource in this range can be the environment for simulation. It's normal for LLMs to be OOM in this environment, and that's what we want to explore - what size LLMs are best suited for edge devices.
Thanks for introducing the project; it's a great reference. In my opinion, the Metrics that this project is geared towards are accuracy metrics such as Accuracy, BLEU, etc.. However, we would like to be able to contribute latency, resource usage, and other metrics that edge devices care more about to the edge LLMs inference. Considering the time, I think is it is a safer solution to add the metrics we care about and combine it with Ianvs with reference to the existing framework. Of course, if time permits, we welcome any feasible solution.