axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Data set caching does not seem to be implemented correctly.

Open PhilipMay opened this issue 1 year ago • 1 comments

Please check that this issue hasn't been reported before.

  • [X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

The data set caching does not seem to be implemented correctly. I have set the dataset_prepared_path. First I trained with a tiny-llama model. And had a loss of about 4. Then with a Mistral model on the same dataset. The loss was super high (about 10). After I deleted the files in the dataset_prepared_path, the loss was back in the normal range (about 4).

Current behaviour

Dataset cached from one model (tokenizer) seems to be loased for other tokenizer. Manual cache cleaning is needed.

Steps to reproduce

see above

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

  • [X] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • [X] My issue title is concise, descriptive, and in title casing.
  • [X] I have searched the existing issues to make sure this bug has not been reported yet.
  • [X] I am using the latest version of axolotl.
  • [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

PhilipMay avatar Feb 06 '24 20:02 PhilipMay

Hey! I think we're aware of this issue. It may be because of using tokenizer class name instead of the tokenizer

https://github.com/OpenAccess-AI-Collective/axolotl/blob/5a5d47458d9aaf7ead798d15291ba3d9bef785c5/src/axolotl/utils/data.py#L137-L158

I've made a PR #1298 which fixes this

NanoCode012 avatar Feb 17 '24 03:02 NanoCode012