zenml-projects icon indicating copy to clipboard operation
zenml-projects copied to clipboard

Multi GPU with PEFT on LLM

Open avishniakov opened this issue 10 months ago • 6 comments

This PR brings the multi-GPU DDP showcase to the PEFT training. There are some routine steps, which can be automated in ZenML core. We will create follow-up tickets for that separately.

This is a companion PR to https://github.com/zenml-io/zenml-projects/pull/99 and it is merged in it, so the diff should be evaluated from the PoV.

Companion PR: https://github.com/zenml-io/zenml/pull/2677

avishniakov avatar Apr 17 '24 14:04 avishniakov

@schustmi @htahir1 you are optional reviewers, just in case you have interest 🙂

avishniakov avatar May 03 '24 14:05 avishniakov

I havnt given this a fair shake but in general what is the better way of doing this vs using subprocess :-D Any ideas?

Not sure, if this answers your question, but I plan to extend the ZenML core to serve automated capabilities for the creation of wrappers + making calls. https://zenml.atlassian.net/jira/software/c/projects/OSSK/boards/13?selectedIssue=OSSK-535 What's, in general, off with subprocessing in your opinion?

avishniakov avatar May 06 '24 10:05 avishniakov

@avishniakov Personally I feel they are quite unstable and unreliable... a better way would to use the internal library and do this in code right?

htahir1 avatar May 06 '24 11:05 htahir1

@avishniakov Personally I feel they are quite unstable and unreliable... a better way would to use the internal library and do this in code right?

In theory, we can hack around the accelerate.command.launch module, but in this module they still will call subprocess.Popen for you, so you cannot get away from subprocessing anyway. I will explore how we can use the module directly.

avishniakov avatar May 06 '24 11:05 avishniakov

I somewhat heavily reworked how the preparation of the functions was done in this project. This is tightly coupled with the changes on the ZenML side.

Looking forward to some conceptual feedback. There are definitely a few weak points:

  • Cache invalidation mechanism due to the use of the external function is far from perfect.
  • The calls are handled via the function from inside the step. It is not straightforward to make the step "script-function" by itself. This is doable but would need more effort and shaking of how we work with steps in the core.

avishniakov avatar May 07 '24 11:05 avishniakov

To be merged after 0.58.3/0.59.0 is released

avishniakov avatar Jun 18 '24 15:06 avishniakov