transformers
transformers copied to clipboard
[Doctests] Refactor doctests + add CI
What does this PR do?
Wow! The fix is crazy simple but I broke my head finding the correct way to include this in the cleanest way.
We are keeping pytest --doctest-module
🥳 Basically it just came down to:
- change the
doctest.DocTestParser()
's default regex compilation - rewrite
_pytest.doctest
utilities that are private to use this parser!
TODOS:
- [x] change parser
- [x] add CI Job
- [x] Filter jobs that can't run on CUDA! This is pretty important
- [ ] update doc
- [x] fiind a way to default
--doctest-glob
but that's a small nit - [ ] add test, to test the parser mostly
- [ ] add check file is doctested if it has some doctring
The documentation is not available anymore as the PR was closed or merged.
Local tests run, I now have this strange error:
_____ ERROR collecting src/transformers/models/whisper/modeling_whisper.py _____
import file mismatch:
imported module 'transformers.models.whisper.modeling_whisper' has this __file__ attribute:
/home/circleci/.pyenv/versions/3.8.12/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py
which is not the same as the test file we want to collect:
/home/circleci/transformers/src/transformers/models/whisper/modeling_whisper.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules
@ArthurZucker Regarding the error you mentioned in the above comment, could you provide the command you used to launch?
Also, for this PR to be merged, two thing we should check are:
- for a single modeling file, how long it will take to run the doctest against it on CircleCI, and if it will fit in the available memory.
- (we should probably run the doctest against a few existing models)
- There should NOT have multiple modeling files being included in
test_to_run
for doctest.- This PR currently checks if a file is in
utils/documentation_tests.txt
, but that file doesn't contain all existing modeling files if I remember correctly.
- This PR currently checks if a file is in
Regarding
There should NOT have multiple modeling files being included in test_to_run for doctest. This PR currently checks if a file is in utils/documentation_tests.txt, but that file doesn't contain all existing modeling files if I remember correctly.
Actually the CI checks doc for all files in the diff that end in .py and .mdx. This is prone to changes! Fully open to recommandations.
For slow test, I skip all codeblocks that includ "cuda" in it, we can refine the filter.
CUDA tests are properly skipped! :
(11 durations < 0.005s hidden. Use -vv to show these durations.)
=================================================================================================================================================================== short test summary info ====================================================================================================================================================================
PASSED docs/source/en/testing.mdx::testing.mdx
PASSED docs/source/en/testing.mdx::testing.mdx
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperForAudioClassification.forward
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperForAudioClassification.forward
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperForAudioClassification.forward
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperForConditionalGeneration.forward
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperForConditionalGeneration.forward
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperForConditionalGeneration.forward
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperModel.forward
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperModel.forward
PASSED src/transformers/models/whisper/modeling_whisper.py::transformers.models.whisper.modeling_whisper.WhisperModel.forward
PASSED docs/source/en/model_doc/wav2vec2.mdx::wav2vec2.mdx
PASSED docs/source/en/model_doc/wav2vec2.mdx::wav2vec2.mdx
SKIPPED [1] <doctest testing.mdx[0]>:1: Codeblock line xxx in xxxx has cuda involved. Skipping
SKIPPED [1] <doctest wav2vec2.mdx[0]>:1: Codeblock line xxx in xxxx has cuda involved. Skipping
Results (19.65s):
3 passed
2 skipped
(py39) arthur_huggingface_co@arthur-gpu-2:~/transformers$
Warning will be imporved
Looked this PR and played a bit with it: so far so good 👍
One thing I found:
SKIP_CUDA_DOCTEST=1 pytest -v --doctest-modules --doctest-glob="*.mdx" docs/source/en/model_doc/longt5.mdx
The doctest is running while I assume it will be skipped as it has cuda
thing.
What should be detailed is that only the codeblocks (and not the entire file) should be skipped. This might be why longt5 is not skipped! I’ll be off for a while, I leave this in your hands! 🤗🤗
For info: I will take over this PR to try to merge it earlier.
Convert to draft for now, as more changes to deal with cuda
is required.
cc @amy @sgugger
I think the PR is ready (You can see the changes I made here), but a few points need to be considered before I merge it.
- no model/dataset being cached as we have done for our daily CI (with our own GCP firestore): so in each PR CI run, they will always be re-downloaded.
- timeout will give a status code
124
and the job will be failed (red
). I am not sure this is really what we want to see on PRs.- ~probably there is some hacky way to avoid this. Not sure.~
- We haven't checked all current files will pass the doctesting. For example, only a subset of modeling files and doc files are tested in our daiily doctest run.
- I assume we don't want to see surprising failed doctest on PR CI. Any suggestion? Like a list of (temporarily) ignored files, or try to run all the files and fix all the failure (..?) before this PR being merged?
All very good questions!
no model/dataset being cached as we have done for our daily CI (with our own GCP firestore): so in each PR CI run, they will always be re-downloaded.
That's okay since the test should only be triggered when someone modifies a guide/docstring. This is not in every PR.
timeout will give a status code 124 and the job will be failed (red). I am not sure this is really what we want to see on PRs.
Indeed, if you can filter that one out to be green instead, it would be better
We haven't checked all current files will pass the doctesting.
The files that are tested on the CI should be present in the list of tests for doctests (that list will be remvoed one day when we get to 100% coverage but we're not there yet).