Fast-LLM icon indicating copy to clipboard operation
Fast-LLM copied to clipboard

[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs

Open tscholak opened this issue 11 months ago • 2 comments

🧐 Problem Description

The version of Fast-LLM used for training is currently not easily accessible. While training job specs (e.g., Toolkit, Kubeflow) provide the image path/URL, references like ghcr.io/servicenow/fast-llm:latest don't indicate which commit or tagged version was used. This makes it difficult to trace back to the exact codebase version for a training run.

💡 Proposed Solution

Include a version string in the output directory of each training run and log it to wandb for visibility.

Details:

  • For tagged release commits, use the semantic version (e.g., v1.2.3).
  • For non-tagged commits, include the commit hash (e.g., abcdef1) and mark the build as "dirty" if uncommitted changes exist (e.g., abcdef1-dirty).
  • Example formats:
    • Tagged release: v1.2.3
    • Non-tagged commit: abcdef1
    • Modified tagged release: v1.2.3-dirty

This version string should:

  1. Be written to a file in the training output directory (e.g., fast_llm_version.txt).
  2. Be logged to wandb:
    • As part of the run configuration (wandb.init(config=...)).
    • As a standalone field (wandb.log).
    • Optionally, as a tag for easier filtering (wandb.init(tags=...)).
  3. Be shown in stdout logs.

🔄 Alternatives Considered

  1. Using container image tags in job specs:
    • Problem: Tags like latest are ambiguous. Job descriptions may not persist (e.g., they could be garbage-collected or lost when a Kubernetes instance is decommissioned).

📈 Potential Benefits

  • Reproducibility: Trace models back to the exact version of Fast-LLM used.
  • Transparency: Facilitates auditing and debugging of training runs.
  • Usability: Avoids manual tracking of version information.

📝 Additional Context

This feature aligns with best practices for software versioning and reproducibility. Common formats like semantic versioning (semver) and commit hashes are widely supported and easy to interpret.

Relevant references:

tscholak avatar Dec 31 '24 15:12 tscholak

That's a good idea, but the git information is lost in the docker image. Do you have an idea on how to recover it?

Also I'd also show the version in stdout logs, and make things match with the version saved in the checkpoint. For non-release version I'd add the Fast-LLM version to the string, ex. v1.2.3-abcdef1-dirty

jlamypoirier avatar Jan 02 '25 20:01 jlamypoirier

the git information is lost in the docker image. Do you have an idea on how to recover it?

we could modify the docker build GitHub action to tamper with fast_llm.version.

tscholak avatar Jan 08 '25 14:01 tscholak