zenml-projects
zenml-projects copied to clipboard
Multi GPU with PEFT on LLM
This PR brings the multi-GPU DDP showcase to the PEFT training. There are some routine steps, which can be automated in ZenML core. We will create follow-up tickets for that separately.
This is a companion PR to https://github.com/zenml-io/zenml-projects/pull/99 and it is merged in it, so the diff should be evaluated from the PoV.
Companion PR: https://github.com/zenml-io/zenml/pull/2677
@schustmi @htahir1 you are optional reviewers, just in case you have interest 🙂
I havnt given this a fair shake but in general what is the better way of doing this vs using subprocess :-D Any ideas?
Not sure, if this answers your question, but I plan to extend the ZenML core to serve automated capabilities for the creation of wrappers + making calls. https://zenml.atlassian.net/jira/software/c/projects/OSSK/boards/13?selectedIssue=OSSK-535 What's, in general, off with subprocessing in your opinion?
@avishniakov Personally I feel they are quite unstable and unreliable... a better way would to use the internal library and do this in code right?
@avishniakov Personally I feel they are quite unstable and unreliable... a better way would to use the internal library and do this in code right?
In theory, we can hack around the accelerate.command.launch
module, but in this module they still will call subprocess.Popen
for you, so you cannot get away from subprocessing anyway. I will explore how we can use the module directly.
I somewhat heavily reworked how the preparation of the functions was done in this project. This is tightly coupled with the changes on the ZenML side.
Looking forward to some conceptual feedback. There are definitely a few weak points:
- Cache invalidation mechanism due to the use of the external function is far from perfect.
- The calls are handled via the function from inside the step. It is not straightforward to make the step "script-function" by itself. This is doable but would need more effort and shaking of how we work with steps in the core.
To be merged after 0.58.3/0.59.0 is released