Refactor autocorpus class
Description
This PR is refactoring the autocorpus.py so that the Autocorpus class is broken apart into smaller components. This should make it more flexible and easier to test.
Opening in draft for now. Not yet complete
Fixes #185
Type of change
- [ ] Documentation (non-breaking change that adds or improves the documentation)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Optimization (non-breaking, back-end change that speeds up the code)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] Breaking change (whatever its nature)
Key checklist
- [ ] All tests pass (eg.
pytest) - [ ] The documentation builds and looks OK (eg.
mkdocs) - [ ] Pre-commit hooks run successfully (eg.
pre-commit run --all-files)
Further checks
- [ ] Code is commented, particularly in hard-to-understand areas
- [ ] Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))
Codecov Report
Attention: Patch coverage is 71.85185% with 76 lines in your changes missing coverage. Please review.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| autocorpus/abbreviation.py | 86.57% <ø> (ø) |
|
| autocorpus/autocorpus.py | 76.66% <100.00%> (+31.25%) |
:arrow_up: |
| autocorpus/bioc_formatter.py | 100.00% <100.00%> (ø) |
|
| autocorpus/data_structures.py | 100.00% <100.00%> (ø) |
|
| autocorpus/reference.py | 100.00% <100.00%> (ø) |
|
| autocorpus/section.py | 92.96% <100.00%> (-0.56%) |
:arrow_down: |
| autocorpus/table.py | 86.14% <100.00%> (+1.16%) |
:arrow_up: |
| autocorpus/utils.py | 73.49% <100.00%> (ø) |
|
| autocorpus/bioc_documents.py | 95.23% <83.33%> (ø) |
|
| autocorpus/run.py | 17.64% <50.00%> (+0.98%) |
:arrow_up: |
| ... and 6 more |
... and 2 files with indirect coverage changes
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
This is looking good so far... Much cleaner. I've stuck some comments in here and there.
I've made changes or replied to each inline comment
I'm not sure what your plan is for
autocorpus.py, but I think you could extract some more of those methods (I'm thinking of the table-related ones in particular). Ultimately I think we could just have the output of AC be represented by a dataclass (or maybe multiple) instead of storing the results as fields in theAutocorpusclass (which I'd be in favour of dropping altogether) then retrieving them later.
This was pretty much my goal. But the table-related methods do all kinds of recursive updating of the fields and it was making my head spin trying to work out how to resolve it without breaking the code. Especially because that section of the code is not currently covered by the public regression test. It would be useful to get #265 merged for this reason.
@alexdewar I managed to do it. We now have all the file and html processing logins out in separate modules and the Autocorpus class is a dataclass!
PS - I've run the new regression tests and they're failing on four files. Interestingly, they're failing because of an uncaught exception, not because the outputs don't match.
The function the exception is emanating from is get_table_json (which you actually seem not to have changed in this PR):
for i, table in enumerate(soup_tables):
if "class" in table["node"].attrs:
if "table-group" in table["node"].attrs["class"]:
pop_list.append(i)
if table["node"].find_all("tbody") == []:
pop_list.append(i)
empty_tables.append(table)
soup_tables = [table for i, table in enumerate(soup_tables) if i not in pop_list]
for etable in empty_tables:
# has a table element, not empty
> if not etable["node"].find("table"):
E KeyError: 'node'
I've pushed my merged branch (called merge-for-adrian) if you want to take a closer look.
PS - I've run the new regression tests and they're failing on four files. Interestingly, they're failing because of an uncaught exception, not because the outputs don't match.
Sorted! Turns out it was because that loop would never possibly work. I've deleted it because it doesn't make any sense what it was meant to do.