🧐 Problem Description

The version of Fast-LLM used for training is currently not easily accessible. While training job specs (e.g., Toolkit, Kubeflow) provide the image path/URL, references like ghcr.io/servicenow/fast-llm:latest don't indicate which commit or tagged version was used. This makes it difficult to trace back to the exact codebase version for a training run.

💡 Proposed Solution

Include a version string in the output directory of each training run and log it to wandb for visibility.

Details:

For tagged release commits, use the semantic version (e.g., v1.2.3).
For non-tagged commits, include the commit hash (e.g., abcdef1) and mark the build as "dirty" if uncommitted changes exist (e.g., abcdef1-dirty).
Example formats:
- Tagged release: v1.2.3
- Non-tagged commit: abcdef1
- Modified tagged release: v1.2.3-dirty

This version string should:

Be written to a file in the training output directory (e.g., fast_llm_version.txt).
Be logged to wandb:
- As part of the run configuration (wandb.init(config=...)).
- As a standalone field (wandb.log).
- Optionally, as a tag for easier filtering (wandb.init(tags=...)).
Be shown in stdout logs.

🔄 Alternatives Considered

Using container image tags in job specs:
- Problem: Tags like latest are ambiguous. Job descriptions may not persist (e.g., they could be garbage-collected or lost when a Kubernetes instance is decommissioned).

📈 Potential Benefits

Reproducibility: Trace models back to the exact version of Fast-LLM used.
Transparency: Facilitates auditing and debugging of training runs.
Usability: Avoids manual tracking of version information.

📝 Additional Context

This feature aligns with best practices for software versioning and reproducibility. Common formats like semantic versioning (semver) and commit hashes are widely supported and easy to interpret.

Relevant references:

Semantic Versioning
Git documentation on describing tags.
wandb documentation on logging custom metrics and tags.

Dec 31 '24 15:12 tscholak

That's a good idea, but the git information is lost in the docker image. Do you have an idea on how to recover it?

Also I'd also show the version in stdout logs, and make things match with the version saved in the checkpoint. For non-release version I'd add the Fast-LLM version to the string, ex. v1.2.3-abcdef1-dirty

Jan 02 '25 20:01 jlamypoirier

the git information is lost in the docker image. Do you have an idea on how to recover it?

we could modify the docker build GitHub action to tamper with fast_llm.version.

Jan 08 '25 14:01 tscholak

Fast-LLM
Fast-LLM copied to clipboard

[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs

🧐 Problem Description

💡 Proposed Solution

Details:

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context

Fast-LLM Fast-LLM copied to clipboard

[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs

🧐 Problem Description

💡 Proposed Solution

Details:

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context

Fast-LLM
Fast-LLM copied to clipboard