torchtune
torchtune copied to clipboard
[RFC][fix test] missing .item() in frozen nf4 test
Context
- as titled. missing .item()
- but there is another thing. I found the error from frozen-nf4 is larger than the one from bnb-linear
Changelog
- updated nf4 test
Test plan
- pytest tests/torchtune/modules/low_precision/test_nf4_linear.py
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/609
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Thanks for your contribution @skcoirz! This was definitely an oversight on my part. Though do we understand why the current test (err.item() < err_bnb
) is passing in CI?
good question. Let me dig into it.
@rohan-varma , found out the cause.
- unit test is able to detect this test. (confirmed on my local server)
- but in CI, the 2 accuracy tests are bypassed because there is no cuda. (confirmed in logging below)
looks integration test is handled in aws. I would assume we have GPU there? Let's mark them as integration_tests?
https://github.com/pytorch/torchtune/actions/runs/8472914261/job/23216077574
@skcoirz thanks for debugging this. Yeah this is a weird edge case. We can run this GPU test in CI but right now they are meant for more like E2E runs of fine-tuning scripts. So we can mark this test as an integration via @pytest.mark.integration_test
like you suggested, but it is a bit unintuitive (e.g. it'll be our only integration test outside of tests/recipes
). That said, we do need a way to catch this. Out of curiosity, @rohan-varma, why is this a CUDA test anyways? Is NF4 only supported on CUDA? Or is this more of a bnb-side thing?
Edit: alternatively we could just modify our multi-GPU CI job to run on all unit tests and recipe tests. We do have some other distributed-type unit tests, so while it might be a bit of overkill it's probably not a terrible idea to do this anyways.
We don't have NF4Linear in torchtune anymore so this PR is no longer relevant