ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

List of feature ideas

Open SMesForoush opened this issue 3 years ago • 5 comments

Dear Colossal-AI team, There are a few features in my mind that I thought would be helpful to the project, and I wanted to ask if there is any of them which might be more useful so I could start implementing them. Loki-Promtail is a tool for monitoring distributed logs with Grafana. Connecting the Distributed Logger to it and extracting labels from the log structure would be a user-friendly system with a good performance to check cluster logs. I can also create a Prometheus Exporter to help collect nodes status and program stats information. This information can be helpful for administrators and developers to keep track of all nodes and monitor the detailed performance of the system during the training process. Since there are multiple ways to launch a program, we can also add PySpark as another option for running the code on the cluster. There are also a few options (Such as RayServe and TorchServe) we can add for deploying the models after training, but I wasn't sure it would be aligned with the project's goals.

Thanks in advance.

SMesForoush avatar Mar 12 '22 06:03 SMesForoush

Hi, @SMesForoush , thanks for your great ideas! We are doing something similar to your first idea. We profile the GPU memory usage and communication bandwidth in our program. We currently log this to Tensorboard, but I would love to support Loki-Promtail as well. As for PySpark launch option, this is a good feature for me. For model deployment, we are current developing an inference system separately.

FrankLeeeee avatar Mar 12 '22 10:03 FrankLeeeee

Should I create 3 new issues for these and start working on them?

SMesForoush avatar Mar 13 '22 05:03 SMesForoush

Should I create 3 new issues for these and start working on them?

Sure, you can post the issues separately and descript your solution to them. Our engineers are willing to join the discussion.

feifeibear avatar Mar 13 '22 08:03 feifeibear

Should I create 3 new issues for these and start working on them?

Sure, you can post the issues separately and descript your solution to them. Our engineers are willing to join the discussion.

Okay sure. Thanks

SMesForoush avatar Mar 19 '22 15:03 SMesForoush

Great, looking forward to it ! @SMesForoush 🔥🔥🔥

FrankLeeeee avatar Mar 19 '22 15:03 FrankLeeeee

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 03:04 binmakeswell