llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

[draft] huggingface-ilab full-precision model fine-tuning

Open JamesKunstle opened this issue 8 months ago • 3 comments
trafficstars

What does this PR do?

Closes #1427

NOTE: This PR description will be updated when the PR moves from DRAFT to OPEN.

Adds inline::huggingface-ilab to /post_training providers. This provider makes some (somewhat temporary) assumptions about dataset shape and style and model / tokenizer source (e.g. all calls have to be compliant with huggingface APIs). The current impl includes some starlette "hacks" to make synchronous background tasks work correctly. It also does no queuing- new requests are blocked if current request is running.

Next step is to implement the training loop. This will be invoked by subprocessing torchrun <args> train.py <args> to implement FSDP training. The parent-most subprocess handle will live in the current_job object and will be referenced to get the running status of the job. psutil can be used in conjuction with the Process.pid to do some more involved monitoring. Training logs will be written to cache rather than std-out.

KNOWN BUGS:

  1. If a failure happens when the current_job is in a non-finalized state and the background tasks fail, the state doesn't have a TTL to be scheduled over, to the endpoint will effectively be broken until the server is restarted.
  2. Lots of fields in incoming user configuration are ignored.
  3. Cache may be irrevocably destroyed at the end of a run if 'storage_dir' isn't set (because of tempdirs).
  4. PIDs can be recycled so monitoring the training job by PID is an unsustainable pattern.

Test Plan

(to be implemented)

JamesKunstle avatar Mar 14 '25 07:03 JamesKunstle

@bbrowning @SLR722 @cdoern @franciscojavierarceo Here's the impl I was working on for multi-card tuning w/ backgroundable tasks. This is a WIP but I'm sharing kinda early.

JamesKunstle avatar Mar 14 '25 07:03 JamesKunstle

The datasetio API call is broken right now because that got rebased while this PR was open- will go fix.

JamesKunstle avatar Mar 21 '25 08:03 JamesKunstle

@JamesKunstle @cdoern I believe this PR is superseded by #2132 right? Is there anything not covered in the other PR that is included here?

booxter avatar May 14 '25 23:05 booxter

This pull request has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

github-actions[bot] avatar Jul 14 '25 00:07 github-actions[bot]