LLM-Finetuning-Toolkit issues

Allow custom train/test datasets

**Is your feature request related to a problem? Please describe.** I'm working on a problem that requires me to split my data in a specific way (base on dates). Right...

akashsaravanan-georgian

enhancement

good first issue

Add comment to indicate tf32 won't be available for older GPUs

1

**Describe the bug** I'm trying to run this toolkit on colab notebook with T4 GPU and ran into errors. In order to get it working, I needed to turn bf16...

angeliney

Add GitHub Actions CI for checking style, running tests, publishing Docker images, PyPI packages, and documentation.

3

Ensure all releases, style checks, and unit tests can be run via CI, blocking any PRs that fail CI. For Docker packages, use: https://github.com/orgs/georgian-io/packages For PyPI packages, use: https://pypi.org/

truskovskiyk

enhancement

Publish the documentation on Netlify or GitHub Pages instead of the current solution.

2

It's much better to publish documentation on a dedicated static hosting solution. https://docs.github.com/en/pages/getting-started-with-github-pages or https://medium.com/swlh/publish-a-static-website-in-a-day-with-mkdocs-and-netlify-3cc076d0efaf

truskovskiyk

documentation

Add a Makefile to simplify the execution of tests, styling, and other Bash commands.

1

Ensure that we include a Makefile containing all the necessary development commands, such as how to run tests, perform releases, and execute style checks, among others. For a great example,...

truskovskiyk

[LoRA] Use Validation Set

If I have: - test_split: 0.1 - train_split: 0.8 Maybe we can get `calc_val_split=1-0.1-0.8=0.1` split as validation. Maybe also apply something like `max(calc_val_split, 0.05)` to prevent val split to be...

benjaminye

Add data distributed training capabilities.

There is no good & easy-to-start end-to-end distributed training example on the web. Plus, there are so many ways of doing this: via raw [PyTorch](https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series), via [Ray Train](https://docs.ray.io/en/latest/train/train.html), via [TorchX](https://github.com/pytorch/torchx),...

truskovskiyk

enhancement

Add code coverage report for each PR, block PRs if coverage decreases

Reference: https://coverage.readthedocs.io/en/7.4.4/ https://pypi.org/project/pytest-cov/ https://github.com/marketplace/actions/code-coverage-summary https://github.com/marketplace/actions/code-coverage-report-difference

truskovskiyk

[RichUI] Better Dataset Generation Display

**Is your feature request related to a problem? Please describe.** - Dataset creation table display always display all columns of dataset, instead of ones needed by `prompt` and `prompt_stub` -...

benjaminye

enhancement

good first issue

[Dataset] Dataset Generation Always Returns Cached Version

1

**Describe the bug** At dataset creation, the dataset generated will always get the cached version despite change in file. **To Reproduce** 1. Run `toolkit.py` 2. Ctrl-C 3. Add a line...

benjaminye

bug

good first issue

LLM-Finetuning-Toolkit
LLM-Finetuning-Toolkit copied to clipboard

Metadata

Allow custom train/test datasets

Add comment to indicate tf32 won't be available for older GPUs

Add GitHub Actions CI for checking style, running tests, publishing Docker images, PyPI packages, and documentation.

Publish the documentation on Netlify or GitHub Pages instead of the current solution.

Add a Makefile to simplify the execution of tests, styling, and other Bash commands.

[LoRA] Use Validation Set

Add data distributed training capabilities.

Add code coverage report for each PR, block PRs if coverage decreases

[RichUI] Better Dataset Generation Display

[Dataset] Dataset Generation Always Returns Cached Version

← Metadata

Owner

Metadata

LLM-Finetuning-Toolkit LLM-Finetuning-Toolkit copied to clipboard

Metadata

← Metadata

Owner

Metadata

LLM-Finetuning-Toolkit
LLM-Finetuning-Toolkit copied to clipboard