Auto-CORPus icon indicating copy to clipboard operation
Auto-CORPus copied to clipboard

Update XML script + add regression test for XML module

Open Antoinelfr opened this issue 7 months ago • 3 comments

Description

This PR includes:

  • Title text cleaning from new lines and weird formatting.
  • Fix the author formatting in the reference.
  • Change manually \xa0 to space.
  • Fix a bug where the section title was empty due to the title tag being outside the section tag.
  • Fix the keywords section to exclude the title from the text, remove non-English keywords, take multiple lists as opposed to the first one before, and exclude the abbreviation list.
  • Remove tables and figures from the front and back tags.
  • Add IOA IDs for all passages and fixed the "document part" allocation.
  • Fix a bug where chunks of text were not identified because they were not in section tags.
  • Improve the filtering of text artefacts from unwanted sections.

Fixes # (issue)

Type of change

  • [ ] Documentation (non-breaking change that adds or improves the documentation)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Optimization (non-breaking, back-end change that speeds up the code)
  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] Breaking change (whatever its nature)

Key checklist

  • [ ] All tests pass (eg. pytest)
  • [ ] The documentation builds and looks OK (eg. mkdocs)
  • [x] Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • [x] Code is commented, particularly in hard-to-understand areas
  • [x] Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))

Antoinelfr avatar May 22 '25 15:05 Antoinelfr

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autocorpus/file_processing.py 33.33% 2 Missing :warning:
Files with missing lines Coverage Δ
autocorpus/parse_xml.py 37.10% <ø> (ø)
autocorpus/file_processing.py 50.90% <33.33%> (ø)

... and 1 file with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 22 '25 15:05 codecov[bot]

Before merging, let me look at the Windows test failing. I will try to implement the IOA from the main AC as well to avoid redundancy. I also noticed a weird character encoding in the AC paper parsing as opposed to the HTML I will take a look at this one as well.

Antoinelfr avatar May 23 '25 07:05 Antoinelfr

Btw mypy is now happy with parse_xml.py, but we're still ignoring it in .pre-commit-config.yaml. Maybe we should enable mypy for it?

@AdrianDAlessandro

alexdewar avatar May 23 '25 14:05 alexdewar

@Antoinelfr I've brought this up to date with main and added one or two small changes. I'd still recommend addressing all of @alexdewar 's suggestions before merging.

@Thomas-Rowlands I'll leave this with you now to decide when it's ready to merge

AdrianDAlessandro avatar Jun 03 '25 18:06 AdrianDAlessandro

I have made some modifications as suggested by @alexdewar. I also created a summary of the next steps here https://github.com/omicsNLP/Auto-CORPus/issues/294

Antoinelfr avatar Jun 10 '25 15:06 Antoinelfr