ColossalAI
ColossalAI copied to clipboard
List of feature ideas
Dear Colossal-AI team, There are a few features in my mind that I thought would be helpful to the project, and I wanted to ask if there is any of them which might be more useful so I could start implementing them. Loki-Promtail is a tool for monitoring distributed logs with Grafana. Connecting the Distributed Logger to it and extracting labels from the log structure would be a user-friendly system with a good performance to check cluster logs. I can also create a Prometheus Exporter to help collect nodes status and program stats information. This information can be helpful for administrators and developers to keep track of all nodes and monitor the detailed performance of the system during the training process. Since there are multiple ways to launch a program, we can also add PySpark as another option for running the code on the cluster. There are also a few options (Such as RayServe and TorchServe) we can add for deploying the models after training, but I wasn't sure it would be aligned with the project's goals.
Thanks in advance.
Hi, @SMesForoush , thanks for your great ideas! We are doing something similar to your first idea. We profile the GPU memory usage and communication bandwidth in our program. We currently log this to Tensorboard, but I would love to support Loki-Promtail as well. As for PySpark launch option, this is a good feature for me. For model deployment, we are current developing an inference system separately.
Should I create 3 new issues for these and start working on them?
Should I create 3 new issues for these and start working on them?
Sure, you can post the issues separately and descript your solution to them. Our engineers are willing to join the discussion.
Should I create 3 new issues for these and start working on them?
Sure, you can post the issues separately and descript your solution to them. Our engineers are willing to join the discussion.
Okay sure. Thanks
Great, looking forward to it ! @SMesForoush 🔥🔥🔥
We have updated a lot. This issue was closed due to inactivity. Thanks.