litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

[fix][1760] Added fix for the missing `context` key issue in dolly!

Open pytholic opened this issue 1 year ago • 1 comments

What is the type of this PR?

  • [ ] Refactor (refactored code that neither fixes a bug nor adds a feature)
  • [ ] Feature (non-breaking change which adds functionality)
  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Test (including new or correcting previous tests)
  • [ ] Style (changes to code formatting and styling)
  • [ ] Optimization (performance improvements)
  • [ ] Documentation Update (updates to the documentation (i.e. README, comments, docstrings, etc.)
  • [ ] Revert (reverts a previous commit)

Motivation and Context

To fix the issue where context key error was thrown in Dolly dataloader.

Modifications

The following changes were made in this PR.

  • Update _transform in dolly.py
  • Added a test for it
  • Updated printing a bit

How Has This Been Tested?

I added the following test to simulate the original issue. The code has been tests on MacOS.

  • tests/data/test_dolly.py::test_dolly_missing_keys

Previous behavior Screenshot 2024-10-03 at 2 01 29 AM

After fixing image

Checklist:

  • [ ] My code follows the style guidelines of this project
  • [x] I have performed a self-review of my code
  • [x] I have commented on my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [x] My changes generate no new warnings
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [ ] New and existing unit tests pass locally with my changes
  • [ ] Any dependent changes have been merged and published in downstream modules

Notes

  1. I noticed that some tests related to the tokenizer are failing. Note that these are not related to my updates (CC: @rasbt ). Screenshot 2024-10-03 at 1 27 58 AM

  2. I updated the printing logs a bit. The instruction was being printed twice which wasn't appealing. I have also added some newline characters. I applied it to only lora.py. If it is desirable, it can be applied to other scripts as well.

Before image

After format_new

pytholic avatar Oct 02 '24 18:10 pytholic

Thanks for the PR!

I noticed that some tests related to the tokenizer are failing. Note that these are not related to my updates (CC: @rasbt ).

Do you mean the tests in the litGPT (+Thunder) workflow? You can safely ignore them for now. They are related to some dev version issues that are not directly related to LitGPT.

rasbt avatar Oct 03 '24 18:10 rasbt

Thanks for the PR!

I noticed that some tests related to the tokenizer are failing. Note that these are not related to my updates (CC: @rasbt ).

Do you mean the tests in the litGPT (+Thunder) workflow? You can safely ignore them for now. They are related to some dev version issues that are not directly related to LitGPT.

@rasbt I was referring to following tests in cpu-tests:

tests/test_tokenizer.py::test_tokenizer_against_hf[Llama-3.2-1B] XFAIL   [ 88%]
tests/test_tokenizer.py::test_tokenizer_against_hf[Llama-3.2-1B-Instruct] XFAIL [ 88%]
tests/test_tokenizer.py::test_tokenizer_against_hf[Llama-3.2-3B] XFAIL   [ 88%]
tests/test_tokenizer.py::test_tokenizer_against_hf[Llama-3.2-3B-Instruct] XFAIL [ 88%]

I just realized they are xfail in CI, but for some reason, they failed instead of xfailed on my setup. I am not sure why.

pytholic avatar Oct 04 '24 05:10 pytholic

@rasbt I am guessing in test_tokenizer.py, xfail is for cases where models are gated?

pytholic avatar Oct 04 '24 05:10 pytholic

I am guessing in test_tokenizer.py, xfail is for cases where models are gated?

Correct.

I just realized they are xfail in CI, but for some reason, they failed instead of xfailed on my setup. I am not sure why.

When you create a PR from a forked repo, LItGPT's HF_TOKEN is not shared and thus these tests are just skipped. But when you run tests on your machine and you exported this token (and you have access to all gated repos), then these tests are executed.

If you merged the PR, then another round of CI actions would be executed, but this time with HF_TOKEN exported and you'd get a fail.

@rasbt It's important for every PR from a community that affects tokenizers (or gated models) in one way to another to:

  • run tests locally with changes from this branch
  • or create a copy-branch in the repo, not in the fork

Andrei-Aksionov avatar Oct 04 '24 08:10 Andrei-Aksionov

@Andrei-Aksionov Sure, good point.

Looks good!

Screenshot 2024-10-04 at 10 16 41 AM

rasbt avatar Oct 04 '24 15:10 rasbt

Thanks for the PR. This looks great overall. (I would just undo the formatting changes please.)

@rasbt Thanks for checking it out. Sorry for the formatting part, my bad. I am taking an emergency flight atm. I will check the PR feedback and take care of it as soon I find some time.

pytholic avatar Oct 04 '24 15:10 pytholic

No worries and no rush. I hope all is well, and safe travels.

rasbt avatar Oct 04 '24 15:10 rasbt

I took care of the style thing 😊

Thanks a lot @rasbt 🙏🏻

pytholic avatar Oct 04 '24 16:10 pytholic

@rasbt @Andrei-Aksionov I was wondering. If we have access and can run gated tests locally, then ideally those tests should pass. I got issues with llama3.2 tokenizer tests. Do you guys think we need to investigate this?

pytholic avatar Oct 04 '24 16:10 pytholic

@rasbt @Andrei-Aksionov I was wondering. If we have access and can run gated tests locally, then ideally those tests should pass. I got issues with llama3.2 tokenizer tests. Do you guys think we need to investigate this?

Thanks for bringing it up, I'm fixing it here: #1772

rasbt avatar Oct 04 '24 17:10 rasbt