torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

[RFC][fix test] missing .item() in frozen nf4 test

Open skcoirz opened this issue 10 months ago • 5 comments

Context

  • as titled. missing .item()
  • but there is another thing. I found the error from frozen-nf4 is larger than the one from bnb-linear
Screenshot 2024-03-28 at 2 28 58 PM

Changelog

  • updated nf4 test

Test plan

  • pytest tests/torchtune/modules/low_precision/test_nf4_linear.py

skcoirz avatar Mar 28 '24 21:03 skcoirz

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/609

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Mar 28 '24 21:03 pytorch-bot[bot]

Thanks for your contribution @skcoirz! This was definitely an oversight on my part. Though do we understand why the current test (err.item() < err_bnb) is passing in CI?

rohan-varma avatar Mar 28 '24 21:03 rohan-varma

good question. Let me dig into it.

skcoirz avatar Mar 28 '24 21:03 skcoirz

@rohan-varma , found out the cause.

  • unit test is able to detect this test. (confirmed on my local server)
  • but in CI, the 2 accuracy tests are bypassed because there is no cuda. (confirmed in logging below)

looks integration test is handled in aws. I would assume we have GPU there? Let's mark them as integration_tests?

Screenshot 2024-03-28 at 2 54 09 PM https://github.com/pytorch/torchtune/actions/runs/8472914261/job/23216077574

skcoirz avatar Mar 28 '24 21:03 skcoirz

@skcoirz thanks for debugging this. Yeah this is a weird edge case. We can run this GPU test in CI but right now they are meant for more like E2E runs of fine-tuning scripts. So we can mark this test as an integration via @pytest.mark.integration_test like you suggested, but it is a bit unintuitive (e.g. it'll be our only integration test outside of tests/recipes). That said, we do need a way to catch this. Out of curiosity, @rohan-varma, why is this a CUDA test anyways? Is NF4 only supported on CUDA? Or is this more of a bnb-side thing?

Edit: alternatively we could just modify our multi-GPU CI job to run on all unit tests and recipe tests. We do have some other distributed-type unit tests, so while it might be a bit of overkill it's probably not a terrible idea to do this anyways.

ebsmothers avatar Mar 28 '24 23:03 ebsmothers

We don't have NF4Linear in torchtune anymore so this PR is no longer relevant

ebsmothers avatar May 21 '24 14:05 ebsmothers