Auto-CORPus icon indicating copy to clipboard operation
Auto-CORPus copied to clipboard

Refactor autocorpus class

Open AdrianDAlessandro opened this issue 7 months ago • 1 comments

Description

This PR is refactoring the autocorpus.py so that the Autocorpus class is broken apart into smaller components. This should make it more flexible and easier to test.

Opening in draft for now. Not yet complete

Fixes #185

Type of change

  • [ ] Documentation (non-breaking change that adds or improves the documentation)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Optimization (non-breaking, back-end change that speeds up the code)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] Breaking change (whatever its nature)

Key checklist

  • [ ] All tests pass (eg. pytest)
  • [ ] The documentation builds and looks OK (eg. mkdocs)
  • [ ] Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • [ ] Code is commented, particularly in hard-to-understand areas
  • [ ] Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))

AdrianDAlessandro avatar May 22 '25 17:05 AdrianDAlessandro

Codecov Report

Attention: Patch coverage is 71.85185% with 76 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autocorpus/html.py 64.95% 34 Missing and 7 partials :warning:
autocorpus/file_processing.py 53.84% 16 Missing and 2 partials :warning:
autocorpus/file_type.py 78.78% 5 Missing and 2 partials :warning:
autocorpus/config.py 57.14% 3 Missing :warning:
autocorpus/pdf.py 50.00% 3 Missing :warning:
autocorpus/__main__.py 0.00% 2 Missing :warning:
autocorpus/bioc_documents.py 83.33% 0 Missing and 1 partial :warning:
autocorpus/run.py 50.00% 1 Missing :warning:
Files with missing lines Coverage Δ
autocorpus/abbreviation.py 86.57% <ø> (ø)
autocorpus/autocorpus.py 76.66% <100.00%> (+31.25%) :arrow_up:
autocorpus/bioc_formatter.py 100.00% <100.00%> (ø)
autocorpus/data_structures.py 100.00% <100.00%> (ø)
autocorpus/reference.py 100.00% <100.00%> (ø)
autocorpus/section.py 92.96% <100.00%> (-0.56%) :arrow_down:
autocorpus/table.py 86.14% <100.00%> (+1.16%) :arrow_up:
autocorpus/utils.py 73.49% <100.00%> (ø)
autocorpus/bioc_documents.py 95.23% <83.33%> (ø)
autocorpus/run.py 17.64% <50.00%> (+0.98%) :arrow_up:
... and 6 more

... and 2 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 22 '25 17:05 codecov[bot]

This is looking good so far... Much cleaner. I've stuck some comments in here and there.

I've made changes or replied to each inline comment

I'm not sure what your plan is for autocorpus.py, but I think you could extract some more of those methods (I'm thinking of the table-related ones in particular). Ultimately I think we could just have the output of AC be represented by a dataclass (or maybe multiple) instead of storing the results as fields in the Autocorpus class (which I'd be in favour of dropping altogether) then retrieving them later.

This was pretty much my goal. But the table-related methods do all kinds of recursive updating of the fields and it was making my head spin trying to work out how to resolve it without breaking the code. Especially because that section of the code is not currently covered by the public regression test. It would be useful to get #265 merged for this reason.

AdrianDAlessandro avatar May 29 '25 17:05 AdrianDAlessandro

@alexdewar I managed to do it. We now have all the file and html processing logins out in separate modules and the Autocorpus class is a dataclass!

AdrianDAlessandro avatar May 29 '25 21:05 AdrianDAlessandro

PS - I've run the new regression tests and they're failing on four files. Interestingly, they're failing because of an uncaught exception, not because the outputs don't match.

The function the exception is emanating from is get_table_json (which you actually seem not to have changed in this PR):

        for i, table in enumerate(soup_tables):
            if "class" in table["node"].attrs:
                if "table-group" in table["node"].attrs["class"]:
                    pop_list.append(i)
            if table["node"].find_all("tbody") == []:
                pop_list.append(i)
                empty_tables.append(table)
        soup_tables = [table for i, table in enumerate(soup_tables) if i not in pop_list]
    
        for etable in empty_tables:
            # has a table element, not empty
>           if not etable["node"].find("table"):
E           KeyError: 'node'

I've pushed my merged branch (called merge-for-adrian) if you want to take a closer look.

alexdewar avatar May 30 '25 19:05 alexdewar

PS - I've run the new regression tests and they're failing on four files. Interestingly, they're failing because of an uncaught exception, not because the outputs don't match.

Sorted! Turns out it was because that loop would never possibly work. I've deleted it because it doesn't make any sense what it was meant to do.

AdrianDAlessandro avatar Jun 02 '25 14:06 AdrianDAlessandro