llama-stack
llama-stack copied to clipboard
[draft] huggingface-ilab full-precision model fine-tuning
What does this PR do?
Closes #1427
NOTE: This PR description will be updated when the PR moves from DRAFT to OPEN.
Adds inline::huggingface-ilab to /post_training providers.
This provider makes some (somewhat temporary) assumptions about dataset shape and style and model / tokenizer source (e.g. all calls have to be compliant with huggingface APIs).
The current impl includes some starlette "hacks" to make synchronous background tasks work correctly. It also does no queuing- new requests are blocked if current request is running.
Next step is to implement the training loop. This will be invoked by subprocessing torchrun <args> train.py <args> to implement FSDP training. The parent-most subprocess handle will live in the current_job object and will be referenced to get the running status of the job. psutil can be used in conjuction with the Process.pid to do some more involved monitoring. Training logs will be written to cache rather than std-out.
KNOWN BUGS:
- If a failure happens when the
current_jobis in a non-finalized state and the background tasks fail, the state doesn't have a TTL to be scheduled over, to the endpoint will effectively be broken until the server is restarted. - Lots of fields in incoming user configuration are ignored.
- Cache may be irrevocably destroyed at the end of a run if 'storage_dir' isn't set (because of tempdirs).
- PIDs can be recycled so monitoring the training job by PID is an unsustainable pattern.
Test Plan
(to be implemented)
@bbrowning @SLR722 @cdoern @franciscojavierarceo Here's the impl I was working on for multi-card tuning w/ backgroundable tasks. This is a WIP but I'm sharing kinda early.
The datasetio API call is broken right now because that got rebased while this PR was open- will go fix.
@JamesKunstle @cdoern I believe this PR is superseded by #2132 right? Is there anything not covered in the other PR that is included here?
This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.